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We study a physical system which, while devoid of the complexity one usually associates with 
proteins, nevertheless displays a remarkable array of protein-like properties. The constructive hy- 
pothesis that this striking resemblance is not accidental leads not only to a unified framework for 
understanding protein folding, amyloid formation and protein interactions but also has implications 
for natural selection. 
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I. INTRODUCTION 

The revolution in molecular biology pj sparked by the 
discovery of the structure of the DNA molecule 50 
years ago has led to a breathtakingly beautiful descrip- 
tion of life. Life employs well-tailored chain molecules to 
store and replicate information, to carry out a dizzying 
array of functionalities and to provide a molecular basis 
for natural selection. The complementary base pairing 
mechanism in DNA combined with its double-helix struc- 
ture serves as a repository of information and provides 
a pretty mechanism for replication 0- The replication 
is prone to errors or mutations and these errors, which 
are the basis of evolution, are in turn copied in future 
generations Q. Using the RNA molecule as an interme- 
diary, the information contained in the DNA genes is 
translated into proteins, which are linear chains of amino 
acids. Unlike the DNA molecule, which adopts a limited 
number of related structures, protein molecules^, 0, |(J 
fold into thousands of native state structures under phys- 
iological conditions. For proteins, form determines func- 
tionality and the rich variety of observed forms under- 
scores the versatility of proteins. There then follows a 
complex orchestrated dance in which proteins catalyze re- 
actions, interact with each other and finally feedback into 
the gene to regulate the synthesis of other proteins 0. 

A protein molecule is large and has many atoms. In ad- 
dition, the water molecules surrounding the protein play 
a crucial role in its behavior. At the microscopic level, 
the laws of quantum mechanics can be used to deduce the 
interactions but the number of degrees of freedom are far 
too many for the system to be studied in all its detail. 
When one attempts to look at the problem in a coarse- 
grained manner ?] with what one hopes are the essential 
degrees of freedom, it is very hard to determine what the 
effective potential energies of interaction are. This sit- 
uation makes the protein problem particularly daunting 
and no solution has yet been found. 

Over many decades, much experimental data has been 



accumulated yet theoretical progress has been somewhat 
limited. The problem is highly interdisciplinary and 
touches on biology, chemistry and physics and it is of- 
ten hard to distill the essential features of each of the 
multiple aspects of the problem. The great successes of 
quantum chemistry in the determination of the structure 
of the DNA molecule and in the spectacular prediction 
that helices and sheets [H H, are the building blocks 
of protein structures has spurred much work using de- 
tailed chemistry on understanding the protein problem. 
Such work has been very insightful in providing useful 
hints on how proteins behave at the atomic scale in per- 
forming their tasks. The missing feature, of course, in 
such a theoretical approach is that it treats each pro- 
tein as a special entity with all the attendant details of 
the sequence of amino acids, their intricate side chain 
atoms and the water molecules. Such an approach, while 
quite valuable, neither has as a goal nor can lend itself 
to a unified way of understanding seemingly disparate 
phenomena pertaining to proteins. Reinforcing this, ex- 
periments, which are very challenging, are carried out on 
one protein at a time and cry out for an understanding 
of the behavior of an individual class of protein. 

The lessons we have learned from physics are of a dif- 
ferent nature. The history of physics is replete with ex- 
amples of the elucidation of connections between what 
seem to be distinct phenomena and the development of 
a unifying framework, wh ich, in turn, leads to new ob- 
servable consequences 03- There have been many 
attempts at using physics-based approaches for under- 
standing proteins. These have provided valuable insights 
on how one might think about the problem and have 
served as a means of understanding experimental data. 
Yet, no simple unification has been achieved in a deeper 
understanding of the key principles at work in proteins. 

We restrict ourselves to globular proteins which dis- 
play the rich variety of native state structures. There 
are other interesting and important classes of proteins [l3j 
such as membrane proteins and fibrous proteins which we 
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do not consider here. Our goal here is to present a new 
approach to understanding proteins - our focus is on un- 
derstanding the origin of protein structures and how they 
form the basis for both functionality and natural selec- 
tion. Our work points to a unification of the various as- 
pects of all proteins: symmetry and geometry determine 
the limited menu of folded conformations that a protein 
can choose from for its native state structure; these struc- 
tures are in a marginally compact phase in the vicinity 
of a phase transition and are therefore eminently suited 
for biological function; these structures are the molecu- 
lar target for the powerful forces of evolution; proteins 
are well-designed sequences of amino acids which fit well 
into one of these predetermined folds; and proteins are 
prone to misfolding and aggregation leading to the for- 
mation of amyloids, which are implicated in debilitating 
human diseases ^| such as Alzheimer's, light-chain 
amyloidosis and spongiform encephalopathies. 

We present a discussion of the nature of the denatured 
state (which can loosely be thought of as the collection of 
unfolded conformations) and its possible key role in the 
protein folding problem. We also show how disordered 
proteins could fit into our unified framework. 

The problem of how life was created is a fascinat- 
ing one. Our focus is on looking at life on earth and 
asking how it works. The lessons we learn provide 
hints to the answers of deep and fundamental questions 
that have been pondered by our ancients: Was life on 
earth inevitable? Then there is the question posed by 
Henderson[16| about whether the nature of our physical 
world is biocentric? Is there a need for fine-tuning in bio- 
chemistry to provide for the fitness of life in the cosmos 
or even less ambitiously for life here on earth? Surpris- 
ingly, as we will show, a physics approach turns out to 
be valuable for thinking about these questions. 

The main text of the paper contains the principal ideas 
and details of the calculations are relegated to the ap- 
pendices. In section II, we introduce the description of 
a protein as a thick polymer chain and highlight the dif- 
ferences in its phase diagram with respect to the usual 
string and beads model. In section III, we make a com- 
parison of the predictions obtained from the simple tube 
model against experimental data available on protein na- 
tive state structures. In section IV, we introduce a more 
refined model in which the tube picture is reinforced with 
the geometrical constraints that arise in the formation of 
hydrogen-bonds and discuss the resulting phase diagram 
for an isolated peptide chain. In section V, we discuss 
several consequences of our model including the nature of 
the free energy landscape, the innate propensity of pro- 
teins to aggregate into amyloid-like forms and the role 
played by proteins as the targets of natural selection in 
molecular evolution. In section VI, we discuss the nature 
of the denatured state of proteins and its possible role 
in protein folding. In the final section VII, we conclude 
with a summary. 
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FIG. 1: (Color online) Schematic phase diagram for hard rods 
highlighting the rich behaviour and the new (with respect to 
hard spheres) liquid crystal phases exhibited at intermediate 
temperatures. 



II. PHASES OF MATTER: FROM SPHERES TO 
TUBES 



The fluid and crystalline phases of matter can be read- 
ily understood |l7j in terms of the behavior of a simple 
system of hard spheres. The standard way of ensuring 
the self-avoidance of a system of uniform hard spheres 
is to consider all pairs of spheres and require that their 
centers are no closer than their diameter. Studies of hard 
spheres have a venerable history[18j including early work 
by Kepler on the packing of cannonballs in a ship's hold. 
Each hard sphere can be thought of as a point particle or 
a zero dimensional object with its own private space of 
spatial extent equal to its radius. Generalizing to a one 
dimensional object, one must consider a line or a string, 
with private space associated with each point along the 
line, leading to a uniform tube of radius of cross-section 
or thickness, A, with its axis defined by the line. (Like- 
wise, one could consider a collection of interacting tubes.) 
The generalization of the hard sphere constraint to the 
description of the self-avoidance of a tube of non-zero 
thickness is as follows [l9j (see Appendix A): consider all 
triplets of points along the axis of the tube. Draw circles 
through each of the triplets and ensure that none of the 
radii is less than the tube thickness [2^ ■ This prescription 
surprisingly entails discarding pairwise interactions and 
working with effective three body interactions pjl I2H |22]| . 

One may visualize a tube as the continuum limit of a 
discrete chain of tethered disks or coins |21| of fixed ra- 
dius separated from each other by a distance a in the 
limit of a — > 0. The inherent anisotropy associated with 
a coin (the heads to tails direction being different from 
the other two perpendicular to it) reflects the fact that 
there is a special local direction at each position defined 
by the locations of the adjacent objects along the chain. 
An alternative description of a discrete chain molecule 
is a string and beads model in which the tethered ob- 



jects are spheres. The key difference between these two 
descriptions is the different symmetry of the tethered ob- 
jects. Upon compaction of a chain of spheres, each indi- 
vidual sphere tends to surround itself isotropically with 
other spheres unlike the tube situation in which nearby 
tube segments need to be placed parallel to each other. 
Even for unconstrained particles, deviations from spher- 
ical symmetry (replacing a system of hard spheres with 
one of hard rods, for example) lead to rich new liquid 
crystal phases [22,0] ( see Fig. QJ. Likewise, we find that 
the tube and a chain of tethered spheres exhibit quite dis- 
tinct behaviors with one exception - in the presence of an 
attractive self-interaction favoring compaction, the chain 
of coins and the string and beads model behave similarly 
in the limit of vanishing ratios of the radii of the coin 
and sphere to the range of attraction. A detailed com- 
parison between the chain of coins (tube) and the string 
and beads model with a bending rigidity energy term is 
carried out in Appendix B. 

Fig. [3 is a sketch of the phase diagram, at zero 
temperature, of a homopolymer of length L and thick- 
ness A with the range of attractive interaction R. This 
phase diagram has been obtained using detailed com- 
puter simulations accompanied by an approximate mean 
field theory 22] and can be understood on the basis of 
physical arguments. For large values of L/R, there are 
two distinct phases. When A/R is large, the tube is very 
thick compared to the range of attractive interactions 
and one obtains a swollen phase with equal weight for all 
self-avoiding conformations. One finds a very large de- 
generacy with no tendency towards compaction. On the 
other hand, for small A/R, one has a semi-crystalline 
phase[25| in which the tube is stretched out locally with 
nearby sections parallel to each other. A similar struc- 
ture is also obtained for many long tubes - the arrange- 
ment is akin to piling up logs parallel to each other with 
each log surrounded by six other logs in a hexagonal ar- 
ray, the optimal packing in two dimension of coins of 
radius A. Such structures are similar to those found in 
the Abrikosov flux lattice[26| and bear a resemblance to 
liquid crystal order. 

Liquid crystals are a delicate state of matter of rod- like 
molecules which adopt many distinct arrangements sensi- 
tive to external electric and magnetic fields |23ll2^ |. A liq- 
uid crystal phase that is analogous to the semi-crystalline 
phase is the nematic phase in which the molecules move 
as in a regular liquid but with an alignment of their axes. 
Unlike a spin system in which an up spin is different from 
a down spin, in the nematic phase, all that matters is the 
direction of the axis of the particle - there is no up-down 
distinction - and this change in symmetry leads to a first 
order phase transition between the disordered isotropic 
and the ordered nematic phases. Likewise, the phase 
transition between the semi-crystalline phase at low tem- 
peratures and a high temperature disordered phase in 
which there is no compaction of the tube is a first or- 
der transition as in the melting of ice into water. At the 
transition temperature, there is a coexistence of the two 
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FIG. 2: Sketch of the zero temperature phase diagram of a 
tube in the continuum, subject to a self-attraction promoting 
compaction. There are two phases when the tube length L is 
long compared to the range of attractive interaction R. One 
obtains the semi-crystalline phase (with parallel/anti-parallel 
alignment between different stretches of the tube which then 
fill the space with hexagonal symmetry, as depicted in the 
figure) when the tube thickness A is small compared to R 
and a swollen phase when A is large compared to R. There 
are interesting finite size effects in the semi-crystalline phase. 
In the thin tube limit, on decreasing the length there is a 
crossover from the semi-crystalline phase with overall cylin- 
drical symmetry to a featureless compact phase with spherical 
symmetry when L/R ~ (A/R)~ 2 . There is an unusual finite 
size effect when A ~ R near the confluence of three phases 
(at L = 2ttR, A = R for a chain in the continuum): the 
semi-crystalline phase, the featureless compact phase and the 
swollen phase. A marginally compact phase is obtained in 
this regime and displays a dramatic entropy reduction, with 
the choice structure being a helix with a well defined pitch to 
radius ratio (see Fig. |1J. Other structures such as hairpins 
and sheets are present in the marginally compact phase for 
discrete chains (see Fig. [3J 22] . 



phases (e.g., pieces of ice floating in a glass of water) and 
an abrupt transition between the two states. One might 
call such a system a two-state system - one has water 
and/or ice but nothing in between. 

When the tube is short, one would expect finite size 
effects 27] to come into play. In most physical systems, 
such finite size effects are intuitively obvious corrections 
to the bulk scenario and arise from the effects of the finite 
boundaries. For our tube, the simplest situation occurs 
in the swollen phase where the finite size effects are not 
important - short fat tubes continue to adopt open con- 
formations. At the other extreme of small A/R, as one 
reduces the length of the tube, the overall symmetry of 
the folded object crosses over from that of a cylinder (cor- 
responding to the Abrikosov flux lattice-like phase akin 
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to the hexagonal arrangement of parallel, straight logs) 
to a sphere when L ~ R 3 /A 2 and one obtains one out 
of many degenerate featureless compact conformations. 
Physically, for a short tube, there are many more con- 
formations that can be accommodated in the spherical 
topology than in the cylindrical topology without any 
accompanying sacrifice in the attractive interaction en- 
ergy. 

There is a confluence of three distinct types of struc- 
tures: the swollen conformations, the semi-crystalline 
phase and the featureless compact conformations, when 
A ~ R ~ L (Fig. |2J. This interplay leads to quite 
remarkable finite size effects: one obtains a marginally 
compact phase with a huge reduction in the degener- 
acy compared to the featureless compact phase and the 
swollen phase. On raising the temperature, one again 
finds a two-state behavior and the finite size analog of 
a first order transition between the marginally compact 
phase and the disordered phase. The first order transi- 
tion occurs because it is necessary for different nearby 
tube segments to snap into position right alongside each 
other and parallel to each other in order to avail of the 
attraction. The inherent anisotropy of a tube along with 
the fact that A is of order R leads to this requirement. 
Such two-state behavior can, in the simplest scenario, be 
associated with a transition state [23 along suitably cho- 
sen reaction coordinates. The structures of choice [2lll29| 
in the marginally compact phase, for a discrete chain, 
are helices, kissing hairpins, regular hairpins and sheets 
(FiguresEJand^J. Helices, hairpins and sheets are indeed 
characterized by a parallel placement of nearby tube seg- 
ments. The marginally compact phase is poised in the 
vicinity of a phase transition to the swollen phase and 
the structures are therefore flexible [3(j and sensitive to 
the right types of perturbations. 



III. TUBES AND PROTEINS 

There is a truly remarkable coincidence between the 
structures one obtains in the marginally compact physi- 
cal state of matter of short tubes and the building blocks 
of protein native state structures (Fig. |3J. Proteins 
are linear chains of amino acids, of which there are twenty 
naturally occurring types with distinct side chains. The 
backbone and several of the side chains are hydropho- 
bic and, under physiological conditions, globular proteins 
fold rapidly and reproducibly to somewhat compact con- 
formations called their native state structures. In their 
native states, a hydrophobic core is created which is 
space-filling and water is expelled from the interior. Even 
though there are hundreds of thousands of proteins in 
human cells, the total number of distinct folds that they 
adopt in their native states is only of the order of a few 
thousand |3ll l32l l33l| . Furthermore, these structures seem 
to be evolutionarily conserved HE EH] ■ Proteins are 
relatively short chain molecules and indeed longer globu- 
lar proteins form domains which fold autonomously j3^j . 




FIG. 3: (Color online) Building blocks of biomolecules and 
ground state structures associated with the marginally com- 
pact phase of a short tube corresponding to a discrete chain 
of tethered disks of radius A. The axis in the middle in- 
dicates the direction along which the tube thickness A in- 
creases. The top row shows some of the building blocks of 
biomolecules, while the bottom row depicts the correspond- 
ing structures obtained as the ground state conformations of 
a short tube. (Al) is an a-helix of a naturally occurring 
protein, while (A2) and (A3) are the helices obtained in our 
calculations - (A2) has a regular contact map (i.e. a matrix 
whose elements, corresponding to residue pairs, are either or 
1 depending on whether the two given residues are in contact 
or not) whereas (A3) is a distorted helix in which the dis- 
tance between successive atoms along the helical axis is not 
constant but has period 2. (Bl) is a helix of strands in the 
alkaline protease of pseudomonas aeruginosa, whereas (B2) 
shows the corresponding structure obtained in our computer 
simulations. (CI) shows the "kissing" hairpins of RNA and 
(C2) the corresponding conformation obtained in our simula- 
tions. Finally (Dl) and (D2) are two instances of quasi-planar 
hairpins. The first structure is from the same protein as be- 
fore (the alkaline protease of pseudomonas aeruginosa) while 
the second is a typical conformation found in our simulations. 
The sheet-like structure (D3) is obtained for a longer tube (see 
PH for more details) . The biomolecular structures in the top 
row are shown in the C a representation for proteins, and in 
the P representation for RNA kissing hairpins. 



The building blocks of protein structures are helices, hair- 
pins and almost planar sheets (Fig. [3J. Strikingly, short 
tubes, with no heterogeneity, in the marginally compact 
phase form helices with the same pitch to radius ratio 
as in real proteins [13 (Fig. 0J and almost planar sheets 
made up of zig-zag strands. It is interesting to note that 
the helix is a very natural conformation for a tube and oc- 
curs without any explicit introduction of hydrogen bond- 
ing. Recent work on the denatured state of short amino 
acid sequences has suggested that the poly-proline II he- 
lix might be the preferred structure in that phase, even 
though it does not entail the formation of any hydrogen 
bonds j^]. As in the tube case, small globular proteins 
show a two-state behavior ^(1 0, 0> El l-H an d recent 
experiments 0, ^jj have been successful in mapping- 
out the nature of the transition state in several cases. 
Let us make the constructive hypothesis that the ex- 
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FIG. 4: (Color online) (a) Space filling optimal helix, with 
a pitch to radius ratio c* ~ 2.512 (drawn using Mathemat- 
ical As explained in Appendix C, this optimal value is de- 
termined by requiring that the radius of curvature of the he- 
lical curve is equal to half the minimum distance of closest 
approach between different turns of the helix. The corre- 
sponding tube (that can be thought of as being inflated uni- 
formly around the curve) is optimally space filling since it 
stops growing when reaching its maximum thickness both lo- 
cally (the radius of curvature) and non locally (half the mini- 
mum distance of closest approach between different turns) at 
the same time (see Appendices A and C). Such an optimality 
criterion is shared by some of the conformations selected as 
ground states in our simulations in the marginally compact 
phase such as helices or planar hairpin and sheets shown in 
Fig. [3] when it is properly translated for the case of a dis- 
crete chain (see below and eq. <A5t in Appendix A). It can 
be shown |21| that the planarity of hairpins and sheets is a 
consequence of this optimal space-filling criterion. The same 
geometrical feature is strikingly found to hold, within 3%, 
for a- helices occurring in native state of natural proteins [3S| . 
(b) Plot of the ratio fi = pjv £,(«)/ Pl(i) of the non-local ra- 
dius of curvature Pnl(i) = minj<fc r(jCi, Yj, ru) (with {j, k} ^ 
{i — 2, i — 1} , {i — 1, i + 1} , {i + 1, i + 2}) over the radius of 
curvature ph{i) = r(rt-i, Vi, r;+i) as a function of the residue 
index i for the native state structure of sperm whale myo- 
globin (PDB code lmbn), where refers to the spatial coordi- 
nates of the C" atom of the i-th residue, 1 < i < 153 (see Ap- 
pendix A for the definition of the triplet radius r(ri, Tj, r^)). 
In correspondence with the 8 Q-helices present in the myo- 
globin fold, shown as the solid (red) parts in the plot, the 
values of fi oscillate around unity, demonstrating that helices 
in natural proteins are optimally space filling in the sense de- 
scribed above. 



traordinary similarity between the structures adopted by 
short tubes in the marginally compact phase and the 
building blocks of protein native state structures is not 
a mere coincidence. We postulate instead that the tube 
picture presented above is a paradigm for understanding 
protein structures. Quite generally, such postulates are 
of limited utility unless one is able to unify seemingly 
unrelated aspects of the problem and make new predic- 
tions amenable to experimental verification. In our case, 




FIG. 5: Histogram of local thicknesses computed for all 
residues of different protein native structures, when the vir- 
tual chain formed by the backbone C a atoms is viewed as a 
discretized thick tube. At a given residue the local thickness 
is simply the minimum triplet radius over all triplets contain- 
ing that residue (see Appendix A for the definition of triplet 
radius and for an explanation of how such a quantity arises 
within a tube description). 



while the tube idea is theoretical, there is a wealth of 
experimental data already available on proteins. Before 
we proceed to explore the consequences of our hypoth- 
esis, we will first link the tube picture with the protein 
problem using experiments as the guide. 

Let us begin by asking whether the backbone of a pro- 
tein can be described as a tube. Fig. [5] indeed shows 
that, in its native state, the protein backbone can be 
thought of as the axis of a tube of approximate radius of 
cross-section (A) equal to 2.7 A. Interestingly, there are 
small variations in the tube radius especially in the vicin- 
ity of backward bends The tuning of the two length 
scales, A and R, to be comparable to each other happens 
automatically for proteins: the sizes of the amino acid 
side chains determine both the tube thickness and the 
range of interactions. Steric interactions lead to a vast 
thinnin g o f the phase space that protein structures can 
explorep^ l5fj| . Physically, the notion of a thick chain or 
a tube follows directly from steric interactions in a pro- 
tein - one needs room around the backbone to house the 
amino acid side chains without any overlap. The same 
side chains that determine the tube thickness also con- 
trol the range of attraction - the outer atoms of the side 
chain interact through a short-range interaction screened 
by the water. This self-tuning is a quite remarkable fea- 
ture of proteins. 

The rapid folding of small proteins can be understood 
in terms of the inherent anisotropy of a tube and the 
self-tuning of the two key length scales, the tube thick- 
ness and the range of the attractive interactions. In the 
marginally compact phase, in order to avail of the at- 
tractive interactions, nearby segments of the tube have to 



6 




FIG. 6: Sketch of the local coordinate system. For each C a 
atom i (except the first and the last one), the axes of a right- 
handed local coordinate system are defined as follows. The 
tangent vector ti is parallel to the segment joining i — 1 with 
i + 1. The normal vector hi joins i to the center of the circle 
passing through i — 1, i, and i + 1 and it is perpendicular to 
ti. ti and hi along with the three contiguous C a atoms lie in 
a plane shown in the figure. The binormal vector bi is per- 
pendicular to this plane. The vectors ti, hi, bi are normalized 
to unit length. 



snap into place parallel to each other and right up against 
each other. As stated before, both in the tube picture 
and in proteins, the helix and the sheet are characterized 
by such parallel space-filling alignment of nearby tube 
segments. In proteins, such an arrangement serves to ex- 
pel the water from the protein core. As shown by Linus 
Pauling and coworkers [3 §, hydrogen bonds provide the 
scaffolding for both helices and sheets and place strong 
geometrical constraints stemming from quantum chem- 
istry. 



IV. BEYOND THE TUBE ARCHETYPE: A 
REFINED TUBE MODEL INFORMED BY 
PROTEIN DATA 




FIG. 7: (Color online) Phase diagram of ground state confor- 
mations. The ground state conformations were obtained by 
means of Monte-Carlo simulations of chains of 24 C a atoms. 
e_R and ew denote the local radius of curvature energy penalty 
and the solvent mediated interaction energy respectively (see 
Appendix E) . Over 600 distinct local minima were obtained in 
our simulations in different parts of parameter space starting 
from a randomly generated initial conformation. The temper- 
ature is set initially at a high value and then decreased grad- 
ually to zero, (a), (b), (c), (e), (f), (g), (h) are the Molscript 
representation of the ground state conformations which are 
found in different parts of the parameter space as indicated 
by the arrows. The helices and strands are assigned when lo- 
cal or non-local hydrogen bonds are formed according to the 
rules described in Appendix E. Conformations (i), (j), (k), (1), 
(m) are competitive local minima. In the orange phase, the 
ground state is a 2-stranded /3-hairpin (not shown). Two dis- 
tinct topologies of a 3-stranded /3-sheet (dark and light blue 
phases) are found corresponding to conformations shown in 
conformations (b) and (c) respectively. The white region in 
the left of the phase diagram has large attractive values of 
ew and the ground state conformations are compact globu- 
lar structures with a crystalline order induced by hard sphere 
packing considerations |S4| and not by hydrogen bonding (con- 
formation (d)). 



We turn now to a marriage of the tube idea and the 
wealth of information available from a variety of experi- 
mental probes in preparation for the task of explor- 
ing the consequences of our hypothesis. Recall that three 
body local and non-local radii constraints describe the 
self-avoidance of a tube^^ ( see Appendix A). For a dis- 
crete chain, the local three body radius is defined as the 
radius of a circle drawn through three consecutive nodes 
of the chain (in the limit of a continuous chain the local 
three body radius is equal to the radius of curvature). 
The non-local radius at a given node is defined to be the 
smallest among all the radii of circles drawn through that 
node and all pairs of other nodes except for its adjacent 
nodes (see also the caption of Fig. EJb)). Unlike uncon- 
strained matter for which pairwise interactions suffice, 



for a chain molecule, it is necessary to define the context 
of the object that is part of the chain. This is most eas- 
ily carried out by defining a local Cartesian coordinate 
system (see Fig. whose three axes are defined by the 
tangent to the chain at that point, the normal, and the 
binormal which is perpendicular to both the other two 
vectors. A study[52| of the experimentally determined 
native state structures of proteins from the Protein Data 
Bank |5^| reveals that there are clear amino acid aspecific 
geometrical constraints on the relative orientation of the 
local coordinate systems due to sterics and also associ- 
ated with amino acids which form hydrogen bonds with 
each other (see Fig. ED hi Appendix D). 

Recently ^3 j we have carried out Monte Carlo simu- 
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lations of short homopolymers, chains made up of just 
one type of amino acid, subject to these geometrical con- 
straints and physically motivated interaction energies, a 
local bending energy penalty, en, an overall hydrophobic- 
ity, ew, and effective hydrogen bond energies (see Ap- 
pendix E for details about the refined tube model and 
the simulations). The resulting phase diagram and the 
associated structures for short homopolymers of length 
24 are depicted in Fig. In keeping with the behavior 
of the archetype tube discussed earlier, in the vicinity of 
the swollen phase, one obtains distinct assembled tertiary 
structures, quite akin to real protein structures, on mak- 
ing small changes in the interaction parameters. The 
striking similarity between the observed structures and 
real protein structures suggests that our model captures 
the essential ingredients responsible for the limited menu 
of protein native structures. 

The marginally compact phase has distinct structures 
including a single helix, a bundle of two helices, a he- 
lix formed by /3-strands, a /3-hairpin, three-stranded (3- 
sheets with two distinct topologies and a /3-barrel like 
conformation. These structures are the stable ground 
states in different parts of the phase diagram. Further- 
more, conformations such as the (3— a— (3 motif are found 
to be competitive local minima. The specific structure 
depends on the precise values of the local radius of cur- 
vature penalty (a large penalty forbids tight turns asso- 
ciated with helices resulting in an advantage for sheet 
formation) and the strength of the hydrophobic inter- 
actions (a stronger overall attraction leads to somewhat 
more compact well-assembled tertiary structures). The 
topology of the phase diagram allows for the possibility 
of conformational switching leading to the conversion of 
an a-helix to a /3-topology on changing the hydrophobic- 
ity parameter analogous to the influence of denaturants 
or alcohol in experiments |55| . 



V. CONSEQUENCES OF THE PROTEIN-TUBE 
HYPOTHESIS 

We now turn to a study of some of the consequences 
of our postulate that the tube is a useful paradigm for 
understanding protein structures and behavior. We will 
benchmark these against experimental evidence to assess 
their validity. 



A. Energy landscape of proteins 

There have been many prev ious studies of proteins 
from a physics point of view[56j. The standard approach 
is to assume an overall attractive short range potential 
which serves to lead to a compact conformation of the 
chain in its ground state. In the absence of amino acid 
specificity or when one deals with a homopolymer, there 
is a huge number of highly degenerate ground states com- 
prising all maximally compact conformations with high 



Homopolymer (maximally compact) 




Homopolymer (mar ginally compact) 

FIG. 8: Simplified one dimensional sketches of energy land- 
scape. The quantity plotted on the horizontal axis schemat- 
ically represents a distance between different conformations 
in the phase space and the barriers in the plots indicate the 
energy needed by the chain in order to travel between two 
neighboring local minima, (a) Rugged energy landscape for 
a homopolymer chain with an attractive potential promot- 
ing compaction as, e.g., in a string and beads model. There 
are many distinct maximally compact ground state conforma- 
tions with roughly the same energy, separated by high energy 
barriers (the degeneracy of ground state energies would be 
exact in the case of both lattice models and off-lattice models 
with discontinuous square- well potentials), (b) Pre-Sculpted 
energy landscape for a homopolymer chain in the marginally 
compact phase. The number of minima is greatly reduced 
and the width of their basin increased by the introduction 
of geometrical constraints, (c) Funnel energy landscape for 
a protein sequence. As folding proceeds from the top to the 
bottom of the funnel, its width, a measure of the entropy 
of the chain, decreases cooperatively with the energy gain. 
Such a distinctive feature, crucial for fast and reproducible 
folding, arises from careful sequence design in models whose 
homopolymer energy landscape is similar to (a). In contrast, 
funnel-like properties already result from considerations of ge- 
ometry and symmetry in the marginally compact phase (b), 
thereby making the goals of the design procedure the rel- 
atively easy task of stabilization of one of the pre-sculpted 
funnels followed by the more refined task of fine-tuning the 
putative interactions of the protein with other proteins and 
ligands. 



barriers between them (see Fig. |§Ja)). The ground state 
degeneracy and the height of the barriers grow exponen- 
tially with the length of the homopolymer. The role 
played by sequence heterogeneity is to break the degen- 
eracy of maximally compact conformations, leading to 
a unique ground state conformation which, of course, 
depends on the amino acid sequence. Yet, for a typi- 
cal random sequence, the energy landscape is still very 
rugged and is virtually the same as in FigQ^a). A model 
protein moving in such a rugged landscape can be sub- 
ject to trapping in local minima and may not be able 
to fold rapidly, so that glassy behavior may ensue due 
to such trapping. Bryngelson and Wolynes 57] suggested 
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that there is a principle of minimal frustration at work 
for well-designed sequences in which there is a nice fit 
between a given sequence and its native state structure 
carving out a funnel-like landscape|58j which promotes 
rapid folding and avoids the glassy behavior (Fig. Etc)). 

Indeed, given a sequence of amino acids, with all the 
attendant details of the side chains and the surrounding 
water, one obtains a funnel-like landscape with the min- 
imum corresponding to its native state structure. Each 
protein is characterized by its own landscape. In this 
scenario, the protein sequence is all-important and the 
protein folding problem, besides becoming tremendously 
complex, needs to be attacked on a protein-by-protein 
basis. 

In contrast, our model calculations show that the large 
number of common attributes of globular proteins |29ll59j 
reflect a deeper underlying unity in their behavior. At 
odds with conventional belief, a consequence of our hy- 
pothesis is that the gross features of the energy landscape 
of proteins result from the amino acid aspecific common 
features of all proteins. This landscape is (pre)sculptedby 
general considerations of geometry and symmetry (Fig. 
IHLb)). Our unified framework suggests that the protein 
energy landscape ought to have thousands of broad min- 
ima corresponding to putative native state structures. 
The key point is that for each of these minima the desir- 
able funnel-like behavior is already achieved at the ho- 
mopolymer level in the marginally compact part of the 
phase diagram (see Fig. 0). The self-tuning of two key 
length scales, the thickness of the tube and the interac- 
tion range, to be comparable to each other and the inter- 
play of the three energy scales, hydrophobic, hydrogen 
bond, and bending energy, in such a way as to stabilize 
marginally compact structures, also provide the close co- 
operation between energy gain and entropy loss needed 
for the sculpting of a funnelled energy landscape. 

Recent work has shown that the rate of protein folding 
is not too sensiti ve EL 

HI 

to large changes in the amino 
acid seauence|60l l61j. as long as the overall topology of 
the folded structure is the same. Furthermore, muta- 
tional studies 0,EE have shown that, in the sim- 
plest cases, the structures of the transition states are also 
similar in proteins with similar native state structures. 

Sequence designj^ would favor the appropriate na- 
tive state structure over the other putative ground states 
leading to a energy landscape conducive for rapid and 
reproducible folding of that particular protein. Nature 
has a choice of 20 amino acids for the design of protein 
sequences. A pre-sculpted landscape greatly facilitates 
the design process. Indeed, within our model, we find 
that a crude design scheme, which takes into account the 
hydrophobic (propensity to be buried) and polar (desire 
to be exposed to the water) character of the amino acids, 
is sufficient to carry out a successful design of sequences 
with one or the other of the structures shown in Fig. 
The matching of the hydrophobic profile of the designed 
sequence to the burial profile [6^ (as measured by the 
number of neighbors within the range of the hydropho- 



bic interaction) leads to the correct fold in a Monte Carlo 
simulation. As examples, the sequence HPPHHPHH- 
PPPPPPHHPHHPPPPP, with e R = 0.3 uniformly for 
all residues, ew = —0.4 for contacts between H and H, 
and ew = for other contacts, has as its ground state 
the two-helix bundle structure (Fig. 0i) whereas HPH- 
HHPPPPHHPPHHPPPPHHHPP prefers the 0aj3 motif 
(Fig. It is interesting to note that the (3af3 motif 

is only a local minimum in the phase diagram of a ho- 
mopolymer but is stabilized by the designed sequence. 
Also, as is seen experimentally, many protein sequences 
adopt the same native state conformation|64j . Once a 
sequence has selected its native state structure, it is able 
to tolerate a significant deg ree of mutability except at 
certain key locations|45l l46t l62l |65|. Furthermore, mul- 
tiple protein functionalities can arise within the context 
of a single fold|6r|. 

One of the successful methods of protein structure pre- 
diction is based on threading 67]. The basic idea is en- 
tirely consistent with our findings - one uses pieces of 
native state structures of longer proteins as possible can- 
didate structures of a shorter protein - the technique 
is simpler because instead of determining the structure 
from ab-initio calculations, one merely has to select from 
among the putative native state structures. The docu- 
mented success of the threading method confirms that 
each protein does not fashion its own native state struc- 
ture but merely selects from the menu of pre-determincd 
folds. 



B. Amyloid phase of proteins 

A range of human diseases such as Alzheimer's, spongi- 
form encephalopathies and light-chain amyloidosis lead 
to degenerative conditions and involve the deposition of 
plaque-like material in tissue arising from the aggrega- 
tion of proteins^L 113 EM EH- In the case of prions 
|69j , one observes a transition from a to (3 rich structures 
which favors aggregation and causes bovine spongiform 
encephalopathy (BSE) disease. It has been argued [70j 
that the formation of amyloid fibrils occurs in a hierar- 
chical way starting from a chiral /3-strand. The resulting 
structures arise from a competition between the free en- 
ergy gain from the aggregation and the elastic energy cost 
of the distortion. A variety of proteins not involved in 
these diseases also form aggregates very similar to those 
implicated in the diseased state 0,1^. This suggests [TEj 
that the tendency for proteins to aggregate is a generic 
property of polypeptide chains with the specific sequence 
of amino acids playing at best a secondary role. Can 
one understand this general tendency of proteins to form 
amyloids within our framework? 

Let us recall the semi-crystalline polymer phase which 
one obtains when the tube is sufficiently long (or when 
there are many interacting tubes) and is subject to 
attractive interactions leading to compaction. In this 
phase, the tube is stretched out locally with nearby sec- 
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FIG. 9: (Color online) Aggregated structures formed by three 
chains of length 12. We show the lowest energy conformations 
obtained in long simulations for three 12-residue chains con- 
fined within a cubic box of side L = 80 A at T = 0.19 (a), 
T — 0.18 (b) and T = 0.16 (c). The conformations shown 
in (a'), (b') and (c') are the same as in (a), (b), and (c) re- 
spectively, but viewed from a different angle. The parameters 
used in the model are ew = —0.08 and e_R = 0.2 which cor- 
respond to having a single helix ground state in the case of 
a single chain. The simulations start with random extended 
conformations for all chains and are carried out with pivot 
and crank-shaft moves that are accepted or rejected based on 
the Metropolis criterion. Moves that bring the residues out 
of the box are not allowed. The bundle of 3 helices (d) is 
a putative ground state of the system and was obtained in 
a simulation at a very low temperature (T = 0.05) starting 
with isolated single helices. This conformation has the lowest 
energy among those shown but is not the equilibrium confor- 
mation at intermediate temperatures. Indeed, a simulation 
run at T — 0.18 starting with conformation (d) leads to the 
helix bundle being converted into the /3-helix-like conforma- 
tion shown in (b) which is the dominant equilibrium confor- 
mation at this temperature, e) The specific heat as function 
of temperature for the system of three 12-residue peptides. 
The data shown were obtained using the weighted histogram 
technique based on long equilibrium simulations at var- 
ious temperatures between 0.16 and 4. The small shoulder 
(I) corresponds to a condensation of separated peptides into 
a disordered globule. The large peak (II) corresponds to a 
transition from disordered globule to the /3-helix-like phase, 
f) The energy as a function of time (in Monte Carlo steps) 
during a long simulation at a temperature corresponding to 
the maximum of the specific heat, T = 0.195. The simula- 
tion shows several transitions between the disordered globule 
phase and the /3-helix-like phase. 



tions parallel to each other (or has the tubes stacked par- 
allel to each other in a periodic arrangement) and does 
not have the richness we associate with protein native 
state structures. Returning to the protein, one may ask 
whether there are structures which are the analogs of 
those found in the semi-crystalline phase. 



FIG. 10: (Color online) Aggregated structures formed by five 
and ten chains of length 12 with ew = —0.08, en = 0.2. We 
show the lowest energy conformations obtained in long sim- 
ulations for five chains at T — 0.18 (a), and for ten chains 
at T = 0.2 (b). The five chain system is confined within a 
cubic box of side L = 80 A whereas the ten chain system is 
confined within a cubic box of side L — 100 A. The confor- 
mations shown in (a') and (b') are the same as those in (a) 
and (b) but viewed from a different angle. 



In order to assess the role played by the interaction 
between multiple short proteins, let us first consider 
our model homopolymer chain made up of 36 identical 
amino acids in the marginally compact phase of the re- 
fined tube model (see Section IV and Appendix E) with 
the hydrophobic parameter and the local bending energy 
penalty chosen so that the ground state is a single long 
helix. 

On making two incisions in the chain to create three 
distinct chains each containing 12 amino acids, the 
ground state of the system appears to be a bundle of three 
helices (see Fig. Eld)). This helix structure however is 
stable only at very low temperatures. At intermediate 
temperatures, close to but lower than the temperature of 
the specific heat peak, it is destabilized in favor of ag- 
gregated /3-helices (Figures EJa) and|^b)) or sandwiches 
of /3-sheets (Fig. Etc)), due to entropic effects. Cut- 
ting a single chain into parts increases the entropy of the 
system. Unbonded chains are more flexible and this pro- 
motes the formation of interchain hydrogen bonds. The 
/3-sheet structures also show an increased flexibility com- 
paring to the helix bundle, and they have better kinetic 
accessibility from a disordered globule. While the ap- 
pearance of /3-sheet conformations in the case of three 
chains seems to have an entropic origin, it seems likely 
that the ground state of a system of multiple chains does 
in fact consist of aggregated /3-sheets. Indeed simula- 
tions of 5 or 10 chains have shown that /3-structures are 
the most likely choice (see Fig. Ill) [I . 
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The formation of /3-sheet structured protein aggregates 
is favoured with respect to other possible aggregates such 
as helix bundles (which we actually detect in our simula- 
tions, see Fig. Efd)). In the latter case hydrogen bonds 
are saturated within a single helix so that aggregation 
is driven exclusively by the effective hydrophobic attrac- 
tion between different helices. On the other hand, for 
structures such as those shown in Fig. [§| (a-c) hydrogen 
bonds are formed between different chains and several 0- 
strands are left unsaturated at both "ends" of the aggre- 
gate, which can then readily grow by hydrogen-bonding 
to other chains. 

The refined tube model can be used to explore the free 
energy landscape of a homopolymer chain in the vicinity 
of its folding transition temperature, operationally de- 
fined as the specific heat peak temperature. (Of course, 
there is no real phase transition for finite size systems 
such as proteins.) Fig. Illf a) is a contour plot of the free 
energy at a temperature higher than the folding transi- 
tion temperature for the parameter values ew — —0.08 
and en = 0.3 for which the ground state is an a-helix. 
The free energy landscape has just one minimum corre- 
sponding to the denatured phase whose typical confor- 
mations are still somewhat compact. The contour plot 
at the folding transition temperature (Fig. Illf b'O has 
three local minima corresponding to an a-helix, a three- 
stranded /3-sheet and the denatured state. At lower tem- 
peratures, the a-helix is increasingly favored and the 0- 
sheet is never the global free energy minimum. 

That a /3-sheet structure is a significant competitor 
with a large basin of attraction in a region where the 
stable phase is a helix (see Fig. 1110 reinforces the possi- 
bility that the interaction between several proteins could 
stabilize the formation of extended hydrogen bonded 0- 
sheets via the aggregation of individual chains (see |J3 
for experimental evidence that the increased propensity 
for extended single chain /3-conformations as the tem- 
perature is increased could indeed drive the formation of 
/3-aggregates) . These kinds of structures, which resemble 
the basic structures associated with amyloid fibrils, thus 
seem to belong to the general class of pre-determined 
folds, but this time for multiple proteins, and ought to 
be seen ubiquitously in generic proteins |l5l IrjSf . This 
suggests that the key to the prevention of such aggre- 
gates is the stabilization of helices in such proteins and 
evolutionary mechanisms such as proteasomes, molecular 
chaperones 73J and ubiquitination enzymes 1 1 5l l68| . 

Our results show the generic tendency for multiple 
chains of amino acids to form aggregated amyloids rather 
than maintain their protein-like shape. Interestingly, na- 
ture has, on suitable occasions, thwarted the tendency of 
a single long chain to form amyloid by dividing the pro- 
tein into substantially independent domains which fold 
autonomously and are then assembled together. This 
suggests that the variety of protein folds increases with 
length up to a certain point at which they are supplanted 
by the formation of domains or amyloids. 

In a recent paper, Fandrich and Dobson 74] suggested 
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FIG. 11: (Color online) Contour plots of the effective free en- 
ergy (a) at high temperature (T = 0.22) and (b) at the fold- 
ing transition temperature T/ = 0.2 for a single 24-residue 
homopolymeric chain, with ew = —0.08, ea = 0.3. The effec- 
tive free energy, defined as F(Ni + N n i,N w ) = -lnP(Ni + 
N n i,Nw), is obtained as a function of the total number of hy- 
drogen bonds Ni + N n i and the total number of hydrophobic 
contacts Nw from the histogram P(Ni + N n i,Nw) collected 
in equilibrium Monte-Carlo simulations at constant tempera- 
ture. The spacing between consecutive levels in each contour 
plot is 1 and corresponds to a free energy difference of ksT, 
where T is the temperature in physical units. The darker 
the color, the lower the free energy value. There is just one 
free energy minimum corresponding to the denatured state 
at a temperature higher than the folding transition temper- 
ature (Panel (a)) whereas one can discern the existence of 
three distinct minima at the folding transition temperature 
(Panel (b)). Typical conformations from each of the minima 
are shown in the figure. 



that "amyloid formation and protein folding represent 
two fundamentally different ways of organizing polypep- 
tides into ordered conformations. Protein folding de- 
pends critically on the presence of distinctive side chain 
sequences and produces a unique globular fold. By con- 
trast, .... amyloid formation arises primarily from main 
chain interactions that are, in some environments, over- 
ruled by specific side chain contacts." Our results are in 
complete accord with the suggestion that amyloid struc- 
tures may arise from the generic properties of the pro- 
teins with the details of the amino acid side chains play- 
ing a secondary role. However, our work suggests that 
instead of an "inverse side chain effect in amyloid struc- 
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ture formation" | 74| , there is a unifying theme in the 
behavior of proteins. Just as the class of cross- linked 
/?-structures are determined from geometrical considera- 
tions, the menu of protein native state structures is also 
determined by the common attributes of globular pro- 
teins: the inherent anisotropy associated with a tube and 
the geometrical constraints imposed by hydrogen bonds 
and steric considerations. 



C. Natural selection and protein interactions 

Traditionally, the framework of evolution in life works 
through two aspects of organization called the genotype 
and the phenotype. The genotype is the heritable infor- 
mation encoded in the DNA, which is translated through 
the RNA molecules into proteins. The phenotype is valu- 
able for adaptation and at the molecular level plays a key 
role in natural selection. One conventionally assumes 
that there is a selection of phenotypes which leads to an 
enhancement in the numbers of the genotype. Further- 
more, mutations of the genotype lead to the possibility 
of new phenotypes. 

Let us consider the situation at two levels: the se- 
quence level (which is the genotype because it is a direct 
translation from the evolving DNA molecules) and the 
structure level, which we can think of as the phenotype. 
As pointed out by Maynard-Smith[7J4, as the sequence 
undergoes mutation, there must be a continuous network 
that the mutated sequences can traverse without pass- 
ing through any intermediaries that are non-functioning. 
Thus, one seeks a connected network in sequence space 
for evolution by natural selection to occur. There is con- 
siderable evidence, accumulated since the pioneering sug- 
gestion of Kimura|2a| and King and Jukes |77j. that much 
of evolution is neutral. The experimental data strongly 
supports the view that the "random fixation of selec- 
tively neutral or very slightly deleterious mutants oc- 
cur far more frequently in evolution than selective sub- 
stitution of definitely advantageous mutants." f78j. Also 
"those mutant substitutions that disrupt less the exist- 
ing structure and function of a molecule (conservative 
substitutions) occur more frequently in evolution than 
more disruptive ones" [7^. Thus while one has a "ran- 
dom walk" in sequence space that forms a connected net- 
work, there is no similar continuous variation in structure 
space 36, 79]. 

These facts are in accord with our result of a pre- 
sculpted energy landscape that is shared by all proteins 
and has thousands of local minima corresponding to pu- 
tative native state structures - not too few because that 
would not lead to sufficient diversity and not too many 
because that would lead to too rugged a landscape with 
little hope that a protein could fold reproducibly and 
rapidly into its native state structure. Indeed, many pro- 
teins share the same native state fold and often the mu- 
tation of one amino acid into another does not lead to 
radical changes in the native state structure underscor- 



ing the fact that it is not the details of the amino acid side 
chains that sculpt the energy landscape but rather some 
overarching features of symmetry and geometry that are 
common to all proteins. In this respect, the phase of 
matter that comprises the native state structures is one 
that is possibly determined by physical law rather than 
by the plethora of microscopic details in analogy with the 
limited menu of possible crystal structures. 

Anfinsen^J wrote in 1972, "Biological function ap- 
pears to be more a correlate of macromolecular geom- 
etry than of chemical detail." There has been much 
recent progress in extracting information on biological 
function and protein interactionsj80j from the structure 
of proteins and the complexes they form[8l]]. A protein 
structure chosen from the predetermined menu of folds 
contains information on the topology of the folded state. 
Additionally, one can glean information on the nature of 
the exposed surface, crystal packing, and the existence 
of clefts or other geometrical features (which are often 
the active sites of enzymes). The picture is completed 
by knowledge of the sequence of amino acids that folds 
into the structure using which one can infer the amino 
acid composition of the exposed surfaces, the location 
of mutants and conserved residues and evolutionary rela- 
tionships. For some structural families, function is highly 
conserved, whereas for others, one can use the types of 
information described above to guess the function 82] . 

Biological reactions are accelerated by factors more 
than a billion by enzymatic proteins. Enzymes not 
only provide for great catalytic efficiency but are also 
extremely specific in their function. The principal 
mechanism 83] underlying the tremendous enhancement 
of the reaction rate by the enzymes is the lowering of the 
free energy of the transition state of the reaction through 
their specific binding to the substrate or the reactant(s). 
In its native state, an enzyme adopts a structure chosen 
from the menu of pre-determined folds. Strikingly, only 
a small part of this structure is important for the en- 
zymatic action. Generally, there are a few amino acids, 
associated with the active site, which are responsible for 
the catalytic activity. In close proximity, one also finds 
the substrate binding site which provides the specificity, 
often through the classic lock and key mechanism. 

An illustration of enzymatic action and the role of 
molecular evolution is provided by the protease family 
of proteins. In a living cell, there is turnover of proteins 
with new proteins being continually synthesized along 
with the degradation of existing proteins. Proteins re- 
sponsible for degradation through the hydrolysis of pep- 
tide bonds are called proteases. Under physiological con- 
ditions, peptide bonds are stable for a period of around 
a hundred years. The proteases are able to enhance the 
degradation rate selectively by factors of around a billion. 
There are several classes of proteases including serine pro- 
teases (such as chymotrypsin, a digestive enzyme) with 
a very reactive serine residue, cysteine proteases (such 
as papain, which is a digestive enzyme derived from pa- 
paya) with cysteine playing the role of serine, aspartyl 
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proteases (such as renin which controls blood pressure) 
which employs a pair of aspartate groups and metallopro- 
teases (such as collagenase responsible for collagen degra- 
dation in osteo- arthritic cartilage), which use a bound 
metal ion such as zinc to accelerate the hydrolysis. 

In serine proteases, the catalytic triad comprises three 
amino acids, serine, histidine and aspartate, bound to 
each other through hydrogen bonds, whose presence leads 
to the proton being moved away from the serine and the 
creation of a reactive alkoxide ion. The same triad is 
implicated in all serine proteases. Indeed, an example of 
convergent evolution is provided by subtilisin (an enzyme 
that resembles chymotrypsin in its action and is made by 
certain soil bacteria) and its family members, which pos- 
sess the catalytic triad but have a quite different struc- 
ture from chymotrypsin. Here Nature uses different folds 
from the pre-sculpted energy landscape which, on appro- 
priate sequence design, have the same catalytic triad and 
perform similar tasks. 

The limited menu of possible protein folds provides 
a marvellous opportunity for divergent evolution. This 
corresponds to proteins whose native state structure and 
the catalytic triad are the same but with distinct dif- 
ferences in the nature of the binding site. The binding 
site in chymotrypsin is adjacent to the active site and is 
a hydrophobic cavity which facilitates hydrolysis of the 
peptide bonds on the carboxyl side of aromatic or large 
hydrophobic amino acids such as Trp, Tyr, Phe, Met 
and Leu. Relatively small changes in the amino acid se- 
quence, which maintain both the native state structure 
and the active triad lead to other proteins such as trypsin 
(a digestive protein made in the pancreas which cleaves 
after positively charged amino acids lysine and arginine 
due to a change of one of the hydrophobic amino acids 
in the binding cavity to a negatively charged aspartic 
acid) , elastase (a protein made both in the pancreas and 
by white blood cells in which two glycines in the bind- 
ing cavity are replaced by much larger amino acids valine 
and threonine allowing the enzyme to specifically target 
clastin, which is an important building block of blood ves- 
sel walls and ligaments - elastase is able to cleave proteins 
after a glycine and alanine because of the small size of 
the binding cavity) , thrombin (a larger enzyme, the tail- 
end of which bears a significant similarity to the sequence 
of amino acids of chymotrypsin and trypsin and cleaves 
proteins only at arginine-glycine linkages; thrombin is 
a complex regulatory protease which converts a usually 
soluble blood protein fibrinogen into the insoluble fibrin 
causing a blood clot and the cessation of bleeding) , plas- 
min (an enzyme which cleaves proteins after lysine and 
arginine and dissolves blood clots), cocoonase (which also 
cleaves after lysine and arginine in the silk strands of the 
cocoon after the transformation of a caterpillar into a silk 
moth) and acrosin (an enzyme which plays a pivotal role 
in fertilization by creating a hole in the protective sheath 
around the egg and allowing sperm-egg contact). 

As we have seen, evolution along with natural selec- 
tion allow Nature to use variations on the same theme 



facilitated by the rich repertory of amino acids to cre- 
ate enzymes that are able to catalyze a remarkable array 
of diverse and complex tasks in the living cell. The key 
point, of course, is that in order for molecular evolution 
to work in this manner, one needs the constant backdrop 
of folds shaped not by the sequence but determined by 
physical law. Were the folds not immutable and them- 
selves subject to Darwinian evolution, the possibility of 
creating so many subtle and wonderful variations on the 
same theme would not exist. The pre-sculpted landscape 
is the crucial feature that leads to a predetermined menu 
of immutable folds. 

It is known that keyfunctional sites exhibit a high de- 
gree of conservation 84| . Interestingly, co-evolutionary 
analysis has been useful in identifying protein-protein 
interactions [8^. Structural similarity, independent of 
evolutionary homology, can be the key reason why pro- 
teins with different folds share some commonality in en- 
zymatic activity or ligand binding 86] . Conversely, there 
are protein structures such as the TIM barrel 87] which 
are very versatile and are able to house proteins that are 
able to carry out multiple functionalities. Even though 
the proteins are able to perform diverse catalytic tasks, 
Nagano et al.|s3 nn d that the active site is generally 
found at the C-terminal end of the barrel sheets and 
that there are "striking structural superpositions" of the 
metal-ligating and catalytic residues. 

Nooren and Thornton 88] have pointed out that "The 
structure and affinity of a PPI (protein-protein interac- 
tion) is tuned to its biological function and the physiolog- 
ical environment and control mechanism. PPIs presum- 
ably evolve to optimize 'functional' efficacy This does 
not necessarily involve strong interactions. Clearly, weak 
transient interactions that are efficiently controlled are 
also very important in cellular processes." 

There are several attractive features of the picture we 
have developed based on the tube-protein hypothesis. 
First, protein structures lie in the vicinity of a phase 
transition to the swollen phase which confers on them 
exquisite sensitivity, especially in the exposed parts of 
the structure, to the effects of other proteins and ligands. 
The flexibility of different parts of the protein depend on 
the amount of constraints placed of them from the rest 
of the protein|30|. From this point of view, it is easy to 
understand how loops, which are not often stabilized by 
backbone hydrogen bonds can play a key role in protein 
functionality. 

It is useful to reconsider how nature uses the variety 
of amino acids for sequence design. The existence of a 
pre-sculpted energy landscape with broad minima corre- 
sponding to the putative native state structures and the 
existence of neutral evolution demonstrates that the de- 
sign of sequences that fit a given structure is relatively 
easy leading to many sequences that can fold into a given 
structure. This freedom facilitates the accomplishment 
of the next level task of evolution through natural se- 
lection: the design of optimal sequences, which not only 
fold into the desired native state structure, but also are 
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fit in the environment of other proteins. A useful protein 
is one that can interact with other proteins in a syner- 
gistic manner and at the same time is not subject to the 
tendency to aggregate into the harmful amyloid form. 
This suggests that protein engineering studies aimed at 
improving enzymatic function ought to be carried out in 
a two step manner: first, the family of sequences that 
fold into a desired target structure need to be selected 
and a finer design needs to be carried out in the context 
of the substrates and the other proteins that the target 
protein interacts with. Unlike the generality of geome- 
try and symmetry that leads to the menu of native state 
folds, what we have here is a problem of chemistry acting 
within the fixed background of the physically determined 
structures. These considerations suggest that, when the 
information becomes available, protein-protein interac- 
tion networks |89f can be fruitfully viewed not only as the 
interactions between proteins but also as the interactions 
between the structures that house them. 

The characteristics required for protein native state 
structures to be targets of an evolutionary process are 
stability and diversity. Stability is needed because one 
would not want to mutate away a DNA molecule able to 
code for a useful protein, and diversity, in order to al- 
low evolution to build complex and versatile forms. The 
mechanism for natural selection arises naturally in this 
context - DNA molecules that code for amino acid se- 
quences that fit well into one of these predetermined folds 
and have useful functionality thrive at the expense of 
molecules that create sequences that are not useful. In- 
deed, in this picture, sequences and functionality evolve 
in order to fit within the constraints of these folds, which, 
in turn, are immutable and determined by physical law. 



VI. THE DENATURED STATE OF PROTEINS 

Progress occurs in science through the use of construc- 
tive hypotheses with a careful assessment of their con- 
sequences. Experiments not only provide valuable hints 
for selecting between competing hypotheses but are also 
the ultimate test of a given hypothesis. There are strong 
hints from protein experiments that the protein-tube hy- 
pothesis is valid. It provides a unification of the various 
aspects of all proteins: one obtains a pre-sculpted energy 
landscape with relatively few folds, one can rationalize 
how a protein might fold in a cooperative manner into 
its native state conformation, there is the possibility of 
straightforward design of optimal sequences that fit into 
a desired structure, the structures are in a marginally 
compact phase in the vicinity of a phase transition and 
have the flexibility needed for biological function, and 
one can understand the formation of amyloids and the 
role played by the protein structures as a molecular basis 
for natural selection. 

Protein sequence design provides an optimal fit of the 
sequence with one among the menu of pre-sculpted con- 
formations. The question arises of course as to how a 



given sequence is able to reach its native state conforma- 
tion or its home starting from its denatured conforma- 
tion. The answer to this question entails the understand- 
ing of its denatured state EH U H H H |H . 
Unlike the native state which is a somewhat tightly 
bound set of marginally compact conformations, one en- 
visions the denatured state as an ensemble of somewhat 
open conformations that the protein adopts when it is 
not under physiological conditions. 

While one may naively think that the denatured state 
is devoid of any interesting features, recent work has un- 
derscored the possibility that the number of accessible 
conformations is sev erely reduced compared to a random 
chainUl H HJ |H H, |H HI leading to biases in the 
chain direction that persist over the entire length of the 
protein . Indeed, Shortle|94| has argued that "long- 
range structure, which cannot be removed by strongly 
denaturing conditions, could arise predominantly from 
local steric hindrance." He goes on to state that "not 
only does the ribosome determine the primary structure 
of each protein it makes, it also establishes the topologi- 
cal space in which that protein chain will be confined for 
the rest of its existence." 

We build on these insights and the presumed validity of 
our protein-tube hypothesis by making a second hypothe- 
sis that just as there is a one-way correspondence between 
a sequence and its native state structure, there could ex- 
ist a similar correspondence between the sequence and its 
denatured state. In this view, the denatured state can 
thought of as an address of the native state conformation 
and lies within its basin of attraction. 

Unlike the native state, the denatured state has a 
larger entropy and comprises somewhat open conforma- 
tions. Because of this, water plays a quite crucial role in 
the denatured state. Both the above factors lead to local 
interactions 97, 98] playing a more important role than 
non-local interactions in the denatured state. As can be 
seen from Fig. I15f a). the local bending energy term is 
amino acid specific. In addition, in the spirit of the tube 
model, one might ask whether there are extra geometri- 
cal constraints between the local frames of reference (see 
Fig. of neighboring amino acids along the chain. (As 
discussed earlier, at the non-local level, hydrogen bonds 
linking different parts of the chain do place geometrical 
constraints on the reference frames associated with these 
locations.) Physically, such correlations arise from the 
fact that in addition to the C a atom that we have con- 
sidered as a surrogate for the amino acid, all amino acids 
but glycine have a C 3 atom to which the side chain is 
attached. In a chain of coins (see Section II), this cor- 
responds to breaking the symmetry in the plane of the 
coin. Thus one would quite generally expect that side 
chain interactions would lead to correlations between the 
local coordinate frames of nearby amino acids along the 
sequence^. Remarkably, the local steric constraints^ 
and the hydrogen bonds 8, 9] act in concert and both pro- 
mote helices and sheets in the native state. 

One can ask what the effects of such a local interaction 
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are in the absence of any non-local interaction promot- 
ing the compaction of the chain. Let us first consider a 
homopolymer made of just one kind of amino acid. A 
simple chain molecule with a local bending constraint 
leads to a tangent-tangent correlation, (tj - ii +n ) (see Fig. 
El for the definition of the tangent vector), that decays 
exponentially in n. Adding a local binormal-binormal 
interaction term leads quite generally (see the example 
in Appendix F) to the tangent-tangent correlation de- 
caying exponentially with sequence separation but being 
modulated with oscillatory behavior. This generic behav- 
ior underscores the fact that the class of denatured con- 
formations are not merely featureless but rather already 
have short-range structure built into them. Indeed, there 
is a clear reduction in the entropy, due to the short-range 
binormal-binormal interaction, which is reflected in the 
oscillations. 

The situation is vastly more interesting when one con- 
siders a specific sequence of amino acids in its dena- 
tured state. It is clear that one ought to have amino 
acid specific correlations between neighboring coordinate 
frames which reflect the nature and size of the side chains. 
An amino acid like proline with its cumbersome side 
chain configuration can lead to strong constraints in its 
vicinity whereas glycine which lacks th e Cn atom and a 
side chain can provide great flexibility [l0(| and act as 
a joker in a card game. Even if such correlations re- 
flect sm all b ut systematic deviations from the average 
behavior |lfllj . these can build up in a very specific way 
along the sequence leading to a clear imprinting of the 
native state conformation even in the denatured state. 
In this context, it is interesting to note that ShortlepM) 
has shown that " denaturation by at least three different 
agents - truncation, urea and acid - gives rise to es- 
sentially the same persistent native-state like topology" . 
Furthermore, the alte ratio n of the denatured state by 
even a single mutation |l02j| provides further evidence for 
the structure inherent in the denatured state. 

We have shown that the menu of native state struc- 
tures is determined from generic considerations. Se- 
quence specificity is key in determining whether a given 
sequence fits particularly well into one of these conforma- 
tions. Because the menu is large (thousands of confor- 
mations), one has diversity. However, because the menu 
is not too large, a well-designed sequence is able to fold 
rapidly into its native state conformation. Our hypothe- 
sis is that local sequence-specific interactions alone lead 
to a denatured state which is a reflection of the native 
state. The denatured state lies in the basin of attrac- 
tion of the native state and the folding process simply 
entails the action of the appropriate non-local interac- 
tions in leading to the protein adopting the native state 
conformation. 

The situation is somewhat reminiscent of a content- 
addressable memory 1 103j in which partial information 
is converted by the brain to recover the complet e in- 
formation. Such content addr essab le memories |l03j 
as well as the energy landscape |l04j suitable for pre- 



biotic evolution[l05j have been modeled through spin 
glasses [lO^ . The energy landscape of spin glasses is also 
characterized by diversity and stability arising from ran- 
domness and frustration which is quite distinct from the 
the physical mechanisms of short tubes in the marginally 
compact phase. In conventional spin glasses, random- 
ness, which plays a role somew hat s imilar to amino acid 
specific interactions in proteins |l07j . through frustration 
sculpts an energy landscape with many local minima. In- 
deed, a non-random exchange interaction between spins 
would lead to periodic order with much simpler behavior. 
In spin glasses, starting from a random spin configura- 
tion, it is hard to reach a specific local minimum unless 
the exchange constants are tuned in a clever way as in 
a content addressable memory. The landscape is not in- 
variant on changing the exchange interactions and can be 
fashioned at will. For proteins, on the other hand, our 
analysis shows that a rich landscape is obtained even in 
the absence of any sequence heterogeneity and the na- 
ture of the ground states is determine d by geometry and 
symmetry and is therefore immutable |35j|. 

An interesting consequence of the type of denatured 
state described above along with the existence of the 
pre-sculpted landscape is the possibility of disordered 
proteins |108j - sequences that are in temporally fluctu- 
ating denatured form but which fold in the presence of 
distinct substrates to carry out vital multiple function- 
alities. In our picture, these sequences need appropriate 
stabilizing influences to fold. In the absence of these in- 
fluences (substrates), the protein is denatured and is lo- 
cated, colloquially, on the fence between different native 
state structures. Given that finite size effects are severe 
for proteins, the presence of different substrates (leading 
to different boundary conditions) would not only favor 
one competing structure over the others but also result 
in folding to that structure. The simultaneous existence 
of the distinct folds in the energy landscape allows the 
protein to choose from among them depending on the 
precise nature of the stabilizing influence. 



VII. SUMMARY AND PERSPECTIVE 

Symmetry and geometry place strong constraints on 
the types of infinite sized crystal structures and the re ar e 
exactly 230 distinct space groups in 3 dimensions |l09j| . 
Proteins are finite sized objects. Our analysis demon- 
strates that the same kind of symmetry and geometrical 
considerations lead to a finite number of protein folds. 
This number grows with the size of the protein but is 
limited by the fact that proteins beyond a characteris- 
tic length either form autonomous domains or amyloids. 
Unlike the crystalline state of matter, proteins are char- 
acterized by an inherent anisotropy due to their tube- 
like character. A given crystalline structure transcends 
the material that is housed in it - common salt adopts 
the face-centered-cubic lattice structure as also the well- 
packed cannonballs of Kepler 18]. Likewise, different se- 
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quences of proteins can be housed in the same protein fold 
and yet be able to perform different functionalities!^ . 
Protein structures are modular in form being simple as- 
semblages of helices and strands connected by tight turns. 

The unified picture leads to a single free energy land- 
scape with two distinct classes of structures. The amyloid 
phase is dominated by /3-strands linked to each other in a 
variety of forms whereas the native state structure menu 
is an assembly of a-helices and /3-structures. Nature has 
exploited these native state structures in the context of 
the work horse molecules of life. The selection mecha- 
nism for genetic evolution at the molecular level lies in 
the ability of the protein encoded by the gene to fold 
well into one of the predetermined folds and have useful 
function. Unfortunately, however, the proximity of this 
beautiful phase to the generic amyloid phase underscores 
how life can easily malfunction as soon as aggregational 
tendencies of proteins come to the fore. One cannot but 
marvel at the robustness of life. 

An imperfect analogy to the protein problem is a town- 
ship consisting of around a thousand houses (protein 
structures), each with its own distinctive style (topol- 
ogy), determined by geometry and symmetry. The form 
of a house (structure) is the basis of useful functionality. 
A person (protein) whose tastes (sequence) are (is) es- 
pecially matched to a given style of house (native state 
structure) would choose to live in it. Of course, many 
people (proteins) with similar though not identical tastes 
(sequences) might choose the same style of house (native 
state structure) . If a person were to arrive in this town, 
how would she/he know which house to move into? One 
way would be to explore all the house styles until the 
dream house is identified. A vastly more efficient situa- 
tion would occur if the person arrives at the township in 
the vicinity of the house that she/he will eventually oc- 
cupy. This would require that the location of the start- 
ing point (the denatured conformation) is encoded by 
the tastes of the person (the sequence) and is within the 
basin of attraction of her/his dream home (native state 
structure). This, as yet unproven, scenario would greatly 
facilitate the folding of a protein into its native state 
structure accounting for its "surprising simplicity" Q . 

The protein problem, which lies at the intersection of 
many disciplines, is highly complex. Evolution compli- 
cates the situation even further. Human design allows for 
an engineer to devise entirely new ways of accomplishing 
certain tasks - a classic example is the replacement of 
vacuum tubes with semiconductor transistors. Nature 
does not have this luxury in evolutionary design. Na- 
ture takes what she has, tinkers with it and builds on 
it. Thus the notion of optimal design is not particu- 
larly relevant and the future is very strongly correlated 
with the present and the past. A slightly different turn 
of events could have lead to conspicuously different life 
forms. This picture of Nature muddling along through 
evolution combined with the inherent complexity of pro- 
teins makes the problem very daunting. Yet, within this 
complexity, there is a stunning simplicity provided by 
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FIG. 12: (Color online) Pauling and Ramachandran revisited: 
The top row depicts the 'classic' structures of a a-helix (a) 
and a pleated /3-sheet (b). The main-chain backbone atoms 
and the C 13 atoms of the side chain groups are shown (color 
codes are different for (a) and (b)). Hydrogen bonds, which 
stabilize the structures are shown as dashed lines. In the bot- 
tom row we show the Ramachandran plot (c) describing how 
the torsional degrees of freedom (ip,(f>), the backbone dihedral 
angles within an all atom representation, are constrained by 
steric effects. The colored areas in the plot correspond to al- 
lowed regions in conformational space. The structures (a) and 
(b) stabilized by hydrogen bonding indeed lie squarely within 
the sterically accessible regions. An example of a dipeptide 
conformation disallowed because of steric hindrance is shown 
in (d). 



the fixed backdrop of the protein folds determined by 
physical law in the context of which sequences and func- 
tionalities are shaped by evolution. 

We conclude by revisiting the classic theoretical work 
of Pauling|8|, |9J and Ramachandran 49J. Both of them 
considered the protein backbone which is the common 
part of all proteins. Pauling and his coworkers explored 
the types of structures that are consistent with both the 
backbone geometry and the formation of hydrogen bonds. 
They predicted that helices and sheets are the structures 
of choice in this regard (Fig. H2f a.bH. Ramachandran 
and his coworkers carried out their pioneering work more 
than a decade after Pauling. They considered the role of 
excluded volume or steric interactions between the adja- 
cent amino acids in reducing the available conformational 
phase space (Fig. I12f c)h Astonishingly, the two signifi- 
cantly populated regions of the Ramachandran plot cor- 
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respond to the a-helix and the /3-strand. Even though 
hydrogen bonds and sterics are not related to each other, 
they are both promoters of helices and sheets. Is this con- 
currence of events a mere accident? The marginally com- 
pact phase of short tubes has helices and sheets as its pre- 
ferred structures. In order for Nature to take advantage 
of this phase of matter, proteins, which obey physical 
law, may have been selected to conform to the tube ge- 
ometry. Hydrogen bonds serve to enforce the parallelism 
of nearby tube segments, a feature of both helices and 
sheets while sterics emphasizes the non-zero thickness of 
the tube and serves to position it in the marginally com- 
pact phase. Because the marginally compact phase is a 
finite size effect, proteins tend to be relatively short com- 
pared to conventional macromolecules including DNA. 
Indeed, proteins seem to be a vivid example of the adap- 
tation of Nature to her own laws. 

In his insightful book, 'The Fitness of the Environ- 
ment', Henderson extended the notion of Darwinian fit- 
ness to argue that "the fitness of environment is quite 
as essential a component as the fitness which arises in 
the process of organic evolution." Strikingly, the chem- 
istry of proteins ensures that they are self-tuned to oc- 
cupy the marginally compact phase of short tubes. One 
cannot but marvel at how several factors, the steric in- 
teractions; hydrogen bonds which provide the scaffolding 
for protein structures; the constraints placed by quan- 
tum chemistry on the relative lengths of the hydrogen 
and covalent bonds and the near planarity of the peptide 
bonds; and the key role played by water all reinforce and 
conspire with each other to place proteins in this novel 
phase of matter. 

Proteins have proved to be difficult to understand be- 
cause of their inherent complexity with twenty types of 
amino acids and the role played by water, because they 
are relatively short molecules compared to generic man- 
made polymers and are therefore likely to be character- 
ized by 'non-universal' behavior, and because of the com- 
plexities associated with the random process of evolution. 
Nevertheless, our work suggests that there is an under- 
lying stunning simplicity. While sequences and function- 
alities of proteins evolve, the folds that they adopted, 
which in turn determine function, seem to be determined 
by physical law and are not subject to Darwinian evolu- 
tion. In that regard, these folds may be thought of as 
immutable or Platonic. Protein folds do not evolve - 
rather, the menu of possible folds is determined by phys- 
ical law. In that sense, it is as if evolution acts in the 
theater of life and shapes sequences and functionalities 
but does so within the fixed backdrop of the Platonic 
folds. 

Henderson [T3| wrote "The properties of matter and the 
course of cosmic evolution are now seen to be intimately 
related to the structure of the living being and to its 
activities; they become, therefore, far more important 
in biology than has been previously suspected. For the 
whole evolutionary process, both cosmic and organic, is 
one, and the biologist may now rightly regard the uni- 



verse in its very essence as biocentric." His intriguing 
ideas continue to provoke thought even as we strive to 
understand the connections between life and the laws of 
nature. 
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APPENDIX A: THREE-BODY DESCRIPTION OF 
A TUBE 

In this appendix we will describe how a suitable three- 
body potential [19| characterizes the self-avoidance of a 
tube of thickness A whose axis, C, is a smooth curve, 
r(s), parametrized by its arc-length s with < s < L, 
L being the total length of the tube. The tube is a one 
dimensional generalization of the zero dimensional hard 
sphere case as described in the text. The self- avoidance 
of an ensemble of hard spheres, each of radius A, can be 
ensured by requiring that none of the distances between 
all pairs of sphere centers is less than 2A. 

Let us consider, first, a closed curve, i.e. r(0) = r(L). 
At each position, s, along the curve C we consider an in- 
finitesimally thin circular disk of radius A, E(s, A), cen- 
tered at the point r(s) and perpendicular to the tangent 
vector dr(s)/ds at s. The tube is simply the union of 
all the disks. The self avoidance is imposed by requiring 
that pairs of disks at different points do not intersect, 
S(s,A)nS(s / ,A) = Vs,s'. 

There is an easier way to implement the self- avoidance 
(steric constraints) , which underscores the key difference 
between the hard sphere and the tube problem. Indeed, 
in the latter case, there are two classes of lengths which 
are relevant to the steric interaction: the radius of cur- 
vature, | r(s) at each position s and the closest ap- 
proach distances (note that | r(s) |= 1 within the arc- 
length parametrization) . A closest approach occurs at, 
say, points r(si) and r(s2) (si ^ S2) when r(si) — r(s 2 ) is 
perpendicular to both tangent vectors at si and S2- For 
a smooth closed curve there is at leas t one such closest 
approach. It is rather intuitive |llCt llll| that a nec- 
essary and sufficient condition for the self avoidance is 
that A be less than the minimum among | r(s) | _1 Vs 
and 1/2 | r(si) — r(s 2 ) | Vsi,s 2 where r(si),i = 1,2 are 
both perpendicular to r(si) — r(s2). T his minim um is 
called the thickness, A(C), of the curve C HHIm]. The 
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minimum among the closest approach distances is anal- 
ogous to the minimum among all distances between pair 
of centers in the hard sphere problem. The fact that the 
tube is a linear object introduces another length in the 
problem which is the minimum among all radii of curva- 
ture and it is local in nature in the sense that it involves 
nearby points of the curve C: the radius of curvature at 
position s represents the radius of the circle that best 
approximates the curve C at s. 

We now turn to a reformulation of the tube self avoid- 
ance constraint in a much more appealing way that makes 
it more similar to the self-avoidance recipe for hard 
spheres. Following ref. f20| let us consider a triplet of 
positions along the curve C (instead of a pair of centers 
as in the hard sphere problem), = r(s i ),i = 1,2,3. 
These positions define a plane and hence a unique circle 
through them whose radius is 



r-(ri,r 2 ,r 3 ) = 



r2 - ri || r 3 — ri || r 3 - r 2 
4A(ri,r 2 ,r 3 ) 



(Al) 



where A(ri, r 2 , r 3 ) is the area of the triangle whose ver- 
tices are ri , r 2 and r 3 . The theorem proved in reference 
l20ll states that 



A'(C)= min r(r(fli),r(s 2 ),r(*3)) = A(C) (A2) 

si,s2.s 3 

where the s's do not need to be distinct. Indeed 
it is easy to show that when si,s 2 ,s 3 — > s then 
r(r(si), r(s 2 ), r(s 3 )) — >| r(s) the radius of curvature 
at s. Furthermore it is not difficult to show that the 
search for the minima in ea. i|A2fl can be restricted to 



lim r(r(s] 

s 2 ^si 



),r(s 2 ),r(s 3 )) = r(ri,ri,r 3 ) 



(A3) 



which is the radius of the circle through the point r(s 3 ) 
and r(si) and tangent to the curve at the latter point. 

Let us assume that the minimum in eq. I|A2|I is reached 
at three distinct points ri , r 2 , r 3 and let us consider the 
sphere of radius A'(C), ea. i|A2|l . through them. If it is not 
tangent to the curve in at least two of the points ri , r 2 , r 3 
we have a contradiction. Indeed if the sphere is tangent 
to the curve at one or none of the three points we can 
shrink the sphere slightly still keeping three intersections 
with the curve. However this is a contradiction since due 
to the definition of thickness, ea. (|A2() . any sphere of ra- 
dius less than A'(C) cannot intersect the curve in more 
than two points. Thus say that one of the points where 
the tangency occurs is ri. Since the circle through ri 
and r 2 and tangent to the former lies on the sphere it 
implies that r(ri,ri,r 2 ) < A'(C) which is a contradic- 
tion unless the equality holds. This demonstrates that 
the minimum in ea. (|A2l) is never exclusively reached at 
three distinct points along the curve. The above argu- 
ment leads also to the proof of the theorem. In fact if 
the other tangency point is, say, r 2 , then in addition to 



r(ri,ri,r 2 ) = A'(C), one also has r(r 2 ,r 2 ,ri) = A'(C). 
One may immediately prove that this can occur only if 
the tangent vectors r(s,),i = 1,2 are perpendicular to 
r(si)-r(s 2 ). Thus mm SuS2 ^ 3 r(r(s 1 ),r(s 2 ),r(s 3 )) cap- 
tures simultaneously both the radius of curvature and the 
distances of closest approaches, consequently proving the 
equality (fAl)) . 

The local thickness of the tube (global radius of cur- 
vature in ref. 20]), at each r (s) € C, may be defined 
as 



A r(si ) (C) = min r (r (s x ) , r (s 2 ) , r (s 3 )) . 

S2,S 3 



(A4) 




course the thickness A (C) is the minimum of 
as r (s) varies on C. Another theorem proved 
m ref. |20j states that if C can be deformed smoothly 
in order to maximize the thickness without changing the 
knot type, the resulting curve, C* , called "ideal shape" 
of the given knot type, has Aws) (C*) = A (C) for all 
points where | r(s) |^ 0. Fig. [5] is a histogram of local 
thicknesses for a sample of native protein structures. The 
variations of the local thickness around the average value 
2.7 A is about 7 %. 

What we learn from the above mathematical frame- 
work is that a mere pairwise interaction does not suf- 
fice to describe the steric constraint of a tube whose 
axis is a string C |19| . This is because, in addition to 
the distance between two points on a string, one also 
needs to know the context, i.e. the local direction of the 
string in the proximity of the points themselves. Let us 
consider a three-body potential V{r{v\, r\, r 3 )) charac- 
terizing the interaction between three particles on the 
axis of the string in terms of the radius of the circle 
through them (notice that this potential is invariant un- 
der translation, rotation and permutation of the three 
points). V(r) could be the same as commonly used in 
the hard sphere problem, i.e. V(r) — 00 when r < A 
and V(r) = otherwise (in the hard sphere problem r 
is half of the distance between a pair of sphere centers). 
This length scale neatly solves the contextual problem 
mentioned above. When two parts of a chain come to- 
gether, the radius of a circle passing through two of the 
particles on one side of the chain and one particle from 
the other side of the chain turns out to be a measure of 
the distance of approach of the two sides of the chain. On 
the other hand, when one considers three particles con- 
secutively along the chain, the radius of the circle passing 
through them is simply the local radius of curvature. In- 
deed when three such particles form a straight line, the 
radius goes to infinity and the three particles essentially 
become non-interacting. The straight line configuration 
is the best that the particles can do in terms of staying 
away from each other given that they are constrained to 
be neighbors along the chain. In the case of a polymer 
chain, such as a protein, a tube whose axis is a smooth 
string is clearly an approximation. One ought to intro- 
duce a discrete curve {ri, ri, . . . , r/v}, and the continuous 
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variable s of the curve now becomes discrete. In corre- 
spondence with the considerations above, one may again 
define the thickness of a discrete curve, C, as [2 Of 

A(C) =minr(r i ,r J -,r fc ) (A5) 

where now i,j and k are all distinct. For a discrete curve 
there is no guarantee that the minimum is obtained from 
among r(r;, r,, r^) with at least two of the three indices 
separated by one unit (e.g. j — i ± 1), but one can 
still distinguish between a local and a non-local contri- 
bution to the thickness, according to whether k are 
consecutive along the chain or not. In the latter case, the 
minimum obtained from among r(rj, Vj, r^) gives half the 
minimum distance of closest approach computed for the 
discrete chain. Similarly, there is no simple restriction of 
the triplets when one deals with open continuous curves 
with free ends. 



APPENDIX B: TUBE VERSUS STRING AND 
BEADS MODEL 

In this appendix we summarize the main differences 
between the thick polymer model (TP) that we deal with 
in this work and the Edwards' model 112] (EM) in the 
presence of a bending rigidity term (the analogue of the 
Edwards' model in the discrete case is the usual string 
and beads model). In both cases, one may add a twist 
rigidity term, which we neglect here, for simplicity. Let us 
consider the case of continuous chains. The Hamiltonian 
for the generalized EM model is 

H EM ({r}) = \ j L v{s) 2 ds + ^ [ v{s) 2 ds + 
1 Jo * Jo 

+ ? [ L -r(s'))dsds' + 

6 Jo Jo 

■6{r(s)-r(s"))dsds'ds" . (Bl) 
The self-avoidance in the TP model[TT^ is given by 

Wtp(M) = f L f L f L V(R c (r(s),r(s'),r(s")))- 
Jo Jo Jo 

■ dsds'ds" , (B2) 

where R c (r(s), r(s'), r(s")) is the radius of the circle 
through the three points r(s), r(s'), r(s") and 

{oo if r < Ro 
-1 ifi? <r<i?i (B3) 
if Rx < r 

Note that in the limit of a continuous chain the EM 
model needs the introduction of singular potentials, in 



order to deal with the fact that a two-body potential is 
unable to distinguish whether two nearby beads are far 
apart or not along the chain. Within the context of the 
Edwards' model such singularities can then be treated 
successful ly w ithin a perturbative renormalization group 
approach [lXJ|. On the other hand, the need of singular- 
potentials is deftly avoided when using the three-body 
prescripti on im plied by the thickness constraint in the 
TP model[TT3|. 

The details of the discretization scheme matter for a 
discrete chain. First, the discretization introduces a nat- 
ural cut-off length scale. Second, the three-body poten- 
tial V of Equation IB 21 cannot be used by itself for a dis- 
crete chain. Indeed, in the absence of a two-body re- 
pulsion, the chain would collapse onto a circle of radius 
between R and R\ and would wind repeatedly along it. 

High temperature phase It is well known that in 
the high temperature regime the critical behavior of the 
EM in the limit of very long chains is governed by the ex- 
ponent v ~ 0.58, so that a typical length £ measuring the 
spatial extension of the chain scales as £ ~ where L 
is the chain length. The chain is swollen with respect to 
the Gaussian random walk behavior for which v = 1/2. 
The same feature holds for the TP; in the high tempera- 
ture regime the different symmetry properties induced by 
the inherent anisotropy of a thick tube are averaged out, 
and a chain of coins shares the same critical behavior as 
a chain of spheres. 

Interestingly, other features such as the form of the 
two-point tangent-tangent correlation function along the 
chain differentiate the TP from the EM. In the absence 
of twisting rigidity (the intrinsic twist of the chain, as 
defined by the torsion of the corresponding curve for the 
EM or the axis of the tube for the TP, is described by 

the energy term ^ b(s) 2 ds, where S(s) is the binor- 
mal vector which is part of the Frenet triad), one gets a 
simple exponential decay in both cases. However when 
the twisting rigidity K t is introduced, the EM exhibits an 
oscillatory decaying correlation function for any value of 
Kt , whereas the TP crosses over from sim ple to oscillatory 
decay on increasing n t (see Appendix F) |116| . The exis- 
tence of a transition line in the parameter space («;&, K t ) 
separating simple from oscillatory decay is a novel feature 
of the TP model. 

Persistence length Another similarity between the 
TP and the EM is the following: at any fixed value of 
the length L of the chain and of the temperature T, 
the persistence length l p (which is a measure of the dis- 
tance along the chain after which the tangent vectors be- 
come uncorrelated) diverges both for the TP, in the limit 
A — > oo of infinite thickness, and for the EM, in the limit 
Kb — ► oo of infinite bending rigidity. The thickness con- 
straint indeed stiffens the chain locally. Yet, a closer look 
reveals an important difference between the TP and the 
EM model. In an ideal case in which non-local interac- 
tions are disregarded we get a different scaling behavior. 
For the EM, l p ~ Kb/ksT, whereas in the TP the per- 
sistence length does not increase at low temperatures, a 
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FIG. 13: Schematic phase diagram of a E M m odel in the 
temperature (T), bending rigidity («;&) plane |117ll . The differ- 
ent phases are S=Swollen, G=globule, and AG=Asymmetric 
globule. 



FIG. 14: Phase diagram for a thick polymer chain in the 
temperature-thickness (T, A) plane obtained with Monte- 
Carlo simulations (see |22 | for details). The different phases 
are S=Swollen, G=globule, and AG=Asymmetric globule. 



first hint that the low temperature behavior of the TP 
may be radically different than the EM. 

Low temperature phase The anisotropy inherent in 
the thick tube description strongly affects the behavior 
of the TP model at low temperatures, as can be seen 
by comparing the bending rigidity/temperature phase 
diagram (Figure I13fl for the EM and the corresponding 
thickness/temperature phase diagram (Figure lT^|l for the 
TP in the thermodynamic limit. 

Let us first note that whereas the swollen (S) and 
the (disordered compact) globule (G) phase share sim- 
ilar features in the two cases (but see the above dis- 
cussion concerning the correlation function properties in 
the swollen phase), the asymmetric globule (AG) (semi- 
crystalline) phase is different. The persistence length di- 
verges (strictly at T = 0) with the chain length for both 
EM and TP (the chain is locally straight). For the latter, 
this arises from the interplay of the thickness constraint 
and the interaction promoting compaction, so that the 
resulting ground state conformation will have tube seg- 
ments aligned with respect to one another similar to the 
Abrikosov flux lattice, filling the space with hexagonal 
symmetry. For the former, this is a mere consequence of 
the local bending rigidity, so that ground state conforma- 
tions will likely consist of planes stacked onto each other 
with parallel (or antiparallcl) alignment within the same 
plane, but not necessarily between different planes. 

A second crucial difference is that in the limit of zero 
temperature the EM is in the AG phase for all finite values 
of the bending rigidity, whereas the TP exhibits a transi- 
tion from the AG phase to the swollen phase with increas- 
ing thickness. This has profound consequences, especially 
when finite size effects are taken into account. It is in- 
structive to revisit the phase diagram at T = for a TP in 
the plane (L/R,A/R) (see Fig. |2J). If A > R, the chain 
cannot avail of the attraction, the length of the chain 
does not play any role and in the L —* oo limit one gets 
the critical behavior of the swollen phase. When A < R, 
in the thermodynamic limit L — > oo, the chain is in the 



asymmetric globule phase resembling the Abrikosov flux 
lattice with hexagonal symmetry, but novel phenomena 
occur for finite chain length. If L < 2irR all parts of 
the chain are able to interact with each other. When 
L > 2irR, this is still true for small enough thickness. 
However, as the thickness increases, this is not possi- 
ble anymore and the chain adopts a conformation which 
optimizes the attractive interaction. In the long chain 
limit the boundary line between the two regimes scales as 
L/R r~j (R/A) 2 . This result is obtained by equating the 
volume occupied by the tube LA 2 to the volume of the 
sphere of attraction R 3 . For shorter chains the compact 
regime at intermediate thickness in which the chain seeks 
to compact itself within the constraint of the thickness 
is indeed margin al, b eing sandwiched between the fea- 
tureless compact |l!8j and the swollen regime described 
above. It is precisely in this window of parameter space 
that we find marginally compact ground state structures 
such as space-filling helices. This finite-size feature of the 
TP model is quite robust independent of the details in- 
troduced, for instance in the discrete case. None of these 
features are present for the EM case for which there is no 
dependence whatsoever on the bending rigidity at T = 0. 

APPENDIX C: OPTIMAL HELIX 

In this Appendix we derive the value c* of the 
pitch/radius ratio c of an optimal space-filling helix. The 
radius of curvature of such a helix equals the tube radius 
and is equal to half the minimum distance of closest ap- 
proach between different turns of the helix. 

The parametric equation of a helix is 

x(t) = (r cos t,r sin t, vt) ; (CI) 

where the pitch/radius ratio is c = =^-. The tangent and 
the acceleration vectors are: 

x(t) = (— r sin t,r cost, v) ; (C2) 
x(t) = (— r cost, — r sin t,0) . (C3) 
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Since x (t) ■ x (t) 
given by 



PL 



0, the radius of curvature is simply 



|x(t)| ^ + r 2 



(C4) 



independently of i. 

We define the non-local radius of curvature as half the 
distance of closest approach between successive turns of 
the helix. Fix a point A = x (t) on the curve, and com- 
pute the distance d(s,t) — |B — A| from a second point 
B = x (s) moving along the curve as a function of s. The 
non-local radius is then 



Pnl (*) = imin{d(s,i)} , 



(C5) 



with the requirement that dd g S s '^ — at some s* ^ t, im- 
plying that B — A is perpendicular to the tangent vector 
x (s*). Note that the non-local radius need not exist (for 
open curves) and is in principle a varying function of t, 
when the curve is not invariant under translation along 
it. 

Because the helix is invariant under translation along 
the curve, implying that B — A is perpendicular also to 
the tangent vector x (t), we can choose t = so that 



d 2 (s)=d 2 (s,0) =r 2 



2 (1 - coss) + — s z 



(C6) 



The condition allowing to get extremal points for d 2 (s) 







(C7) 



One trivial solution of this equation is s = and there is 
no other solution for sufficiently high pitch to radius ratio 
c = ^t^- If c is decreased, new solutions appear, two at a 
time, the smaller a maximum and the greater a minimum, 
corresponding to the increasing packing of helix turns. 
We are interested in the minimum s* corresponding to 
A and B staying on two consecutive turns, that is ir < 
s* < 2ir. For sufficiently low c, the above equation then 
defines the implicit function s* (c), and one has 
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PNL = 7TG? [s* (c)] 



(C8) 



for all points of the helix. In the limit c <C 1, one has 
s* ~ 2ir and thus pnl — — |, where p — 2nv is 
the pitch of the helix, as expected. The particular value 
c = c*, for which the local and the non-local radius of 
curvature are equal, is then defined by 



2r 



1 



(C9) 



Thus, according to the definition of thickness given in 
Appendix A, A^eUx = Pl if c > c*, since the radius of 
curvature is smaller than the non-local radius. A tube 



swelling around the helix would stop increasing due to 
local singularities, leaving space between the successive 
turns of the helix. On the other hand, the non-local ra- 
dius is smaller than the radius of curvature if c < c*, 
implying Ahciix = d (s* (c)) /2. In such a case, the tube 
would stop swelling due to self-intersection between dif- 
ferent turns, leaving a hole in the middle of the helix. At 
c = c* , one obtains an optimal space- filling helix with a 
special pitch to radius ratio of c* w 2.512 (shown in Fig. 
0. 



APPENDIX D: GEOMETRICAL CONSTRAINTS 
DETERMINED FROM EXPERIMENTAL DATA 



In this Appendix we describe the data analysis used 
to elucidate the geometrical constraints imposed by ster- 
ics and hydrogen bonds (see Fig. 115)1 . We have used 
a database of 600 different protein native structures |l!9| 
consisting of sequences varying in length from 44 to 1017, 
with low sequence homology and covering many differ- 
ent three-dimensional folds according to the Structural 
Classification of Proteins (SCOP) scheme 120]. Panel (a) 
depicts the histogram of the local radius of curvature as- 
sociated with two classes of triplets, the first (shown in 
red) featuring strong a-helix forming amino acids (LEU, 
ALA, GLU) and the second (shown in blue) featuring 
/3-strand formers (VAL, ILE, TYR)[13( and underscores 
the vital role of chemistry in choosing from among the 
menu of native state folds. The vertical dashed line in- 
dicates the threshold length scale chosen in the model 
for the curvature energy penalty. The remaining panels 
show histograms for several quantities involved in the def- 
inition of hydrogen bonds: the C a -C a distance between 
i, i + 3 atoms given that i, i + 1, i + 2, i + 3 all belong 
to a helix (Panel (b)) and between i,j (with j > i + 4) 
atoms given that i, j, belong to a /3-strand (Panel (c)); 
the scalar products 6, • bj (Panel (d)) and (bi + bj) ■ fij/2 
(Panel (e)) for i,j contacts (with \j — i\ = 3 (red) and 
with \j — i\ > 4 (blue) provided that no closer inter- 
strand contact is present among i ± 1, j ± 1) (bi is the 
binomial vector at atom i and fjj is the vector joining 
atoms i and j normalized to unit length); and the scalar 
product bi -bi + \ for consecutive residues along a /?-strand 
(Panel (f)). In each case, the dashed lines and arrows 
depict the approximate constraints used in our model. 
All histograms are normalized in such a way that a flat 
distribution would have a constant unit height. 



APPENDIX E: DETAILS OF MODEL AND 
MONTE-CARLO SIMULATIONS 



The protein backbone is modeled as a chain of C a 
atoms with a fixed distance of 3.8 A between successive 
atoms along the chain, an excellent assumption for all but 
non-cis Proline amino acids [l3j. The geometry imposed 
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FIG. 15: (Color online) Statistical analysis of several quan- 
tities computed for residues classified as participating in sec- 
ondary structures in protein native state structures from the 
Protein Data Bank. Red (blue) histograms refer to residues 
participating in a- helices (/3 strand). 



by chemistry dictates that the bond angle associated with 
three consecutive C a atoms is between 82° and 148°. 
Tube geometry. Self-avoiding conformations of the 
tube whose axis is the protein backbone are identified 
by considering all triplets of C a atoms and drawing 
circles through them and ensuring that none of their 
radii is smaller than the tube radius At the lo- 

cal level, the three body constraint ensures that a flex- 
ible tube cannot have a radius of curvature any smaller 
than the tube thickness in order to prevent sharp corners 
whereas, at the non-local level, it does not permit any 
self-intersections. The backbone of C a atoms is treated 
as a flexible tube of radius 2.5 A, a constraint imposed on 
all (local and non-local) three body-radii, an assumption 
validated for protein native structures|48j. 
Sterics. Steric constraints require that no two non- 



adjacent C a atoms are allowed to be at a distance closer 
than 4 A. Ramachandran and Sasisekharan^^ showed 
that steric considerations based on a hard sphere model 
lead to clustering of the backbone dihedral angles in two 
distinct a and (3 regions for non-glycyl and non-prolyl 
residues. The two backbone geometries that allow for 
systematic and extensive hydrogen bonding^, U are 
the a-helix and the /3-sheet obtained by a repetition 
of the backbone dihedral angles from the two regions 
respectively |43|. Short chains rich in alanine residues, 
which are a good approximation to a stretch of the back- 
bone, can adopt a helical conformation in water (see |l2l| 
for a detailed discussion of experimental conditions nec- 
essary to achieve this). However, when one has more 
heterogeneous side chains, the helix backbone could ster- 
ically clash with some side-chain conf ormers resulting in 
a loss of conformational entropy [122J. When the price 
in side-chain entropy is too large, an extended backbone 
conformation results pushing the segment towards a (3- 
strand structure 0. These steric constraints are approx- 
imately imposed through an energy penalty (denoted by 
en) when the local radius of curvature is between 2.5 
A and 3.2 A. (The magnitude of the penalty does not 
depend on the specific value of the radius of curvature 
provided it is between these values.) There is no cost 
when the local radius exceeds 3.2 A. Note that the tube 
constraint does not permit any local radius of curvature 
to take on a value less than the tube radius, 2.5 A. 

Hydrogen bonds. We do not allow more than two 
hydrogen bonds to form at a given C a location. In 
our representation of the protein backbone, local hydro- 
gen bonds form between C a atoms separated by three 
along the sequence with an energy defined to be —1 unit, 
whereas non-local hydrogen bonds are those that form 
between C a atoms separated by more than 4 along the se- 
quence with an energy of —0.7. This energy difference is 
based on experimental findings that the local bonds pro- 
vide more sta bility to a protein than do t he no n-local hy- 
drogen bonds |123j . Cooperativity effects 124] are taken 
into account by adding an energy of —0.3 units when con- 
secutive hydrogen bonds along the sequence are formed. 
There is some latitude in the choice of the values of these 
energy parameters. The results that we present are ro- 
bust to changes (at least of the order of 20%) in these 
parameters. 

Geometrical constraints due to hydrogen bond- 
ing. For hydrogen bond formation between atom i and 
j, the distance between these atoms ought to be between 
4.7 A and 5.6 A (4.1 A and 5.3 A) for the local (non- 
local) case (see Fig. I15f b) for the local case). A study of 
protein native state structures reveals an overall nearly 
parallel alignment of the axes defined by three vectors: 
the binormal vectors at i and j and the vector ry join- 
ing the i and j atoms. A hydrogen bond is allowed to 
form only when the binormal axes are constrained to be 
within 37° of each other, whereas the angle between the 
binormal axes and that defined by ought to be less 
than 20° (see Fig. I15f c)). Additionally, for the coopera- 
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tive formation of non-local hydrogen bonds, one requires 
that the corresponding binormal vectors of successive C a 
atoms make an angle greater than 90° (see Fig. HBT d)). 
The first and the last residues of the chain are special 
cases since their binormal vectors are not defined. In 
order for such residues to form a hydrogen bond (with 
each other or with other internal residues in the chain), 
it is required that the angle between the associated end- 
ing peptide link and the connecting vector to the other 
residue participating in the hydrogen bond is between 
70° and 110°. As in real protein structures, when he- 
lices are formed, they are constrained to be right-handed. 
This is enforced by requiring that the backbone chiral- 
ity associated with each local hydrogen bond is positive. 
The chirality is defined as the s ign of the scalar product 
(r M+ i x r i+M+2 ) ■ r i+2> i +3 |l2||. 

Hydrophobic interactions. The hydrophobic (hy- 
drophilic) effects mediated by the water are captured 
through a relatively weak interaction, ew, (either attrac- 
tive or repulsive) between C a atoms which are within 
7.5 A of each other. Note that hydrogen bonds can eas- 
ily be formed between the amino acid residues in an ex- 
tended conformation and the water molecules. Within 
our model, the intrachain hydrogen bond interaction in- 
troduces an effective attraction, because water molecules 
are not explicitly present. The hydrophobicity scale is 
thus renormalized (e.g. even when ew is weakly positive, 
there could be an effective attraction resulting in struc- 
tured conformations such as a single helix or a planar 
sheet). A negative ew is, in any case, crucial for promot- 
ing the assembly of secondary motifs in native tertiary 
arrangements. 

Monte Carlo simulations are carried out with pivot and 
crankshaft moves commonly used in stochastic chain dy- 
namics |l26j . A Metropolis procedure is employed with 
a thermal weight exp (-E/T), where E is the energy of 
the conformation and T is the effective temperature. 



APPENDIX F: CORRELATION FUNCTIONS IN 
THE DENATURED STATE 



defined by rj_i,rj and r i+1 is rotated along the axis U 
with respect to the plane defined by ri_ 2 ,ri_i and r^. 
Quite generally the joint probability distribution of an- 
gles, V(02, 4>3, 4>&i ■ ■ ■) wu l depend on the entire ensem- 
ble of interactions including the steric interactions. How- 
ever in the phase we wish to study we will assume that 
this probability distribution can be factorized, i.e. we will 
consider the case where we have probability distributions 
p{9i, 4>i) for each pair of angles 9i, fa with z = 3,4,... and 

V(8 2 , <k, ^3, 04, 04, ■■■) = P2(0 2 ) J] Pi(0i, fa) , (Fl) 

i>3 

where the contribution for the angle 62 between the 
first two vectors of the chain, t\ and t 2 , has been selected 
out. The average with respect to V will be written as 
(-)-p whereas the average with respect to pi(9,(j>) will be 
denoted simply as (-)j. In the case of a protein sequence 
the pi(9i,4>i) depends explicitly on the type of amino- 
acids in the neighborhood of the i-th position. It is this 
dependence that ultimately will determine the propensity 
of a given segment of the protein sequence to be in a given 
secondary structure. One can straightforwardly derive 
the following recursion relations: 

sin 9i cos fa 

U — — ti-2 : — 7 

sin tii _i 

+ t;_i (cos 6i sin 9i cos fa cot 6>.;_i ) 

+ bi-i sin#i sin^j , (F2) 
sin fa 

bi = + ti-2-r-2 U-i cotfc'i-i sin 4 

sin tii^i 

+ bi-i cos fa . (F3) 

If one wishes to calculate the correlation function 
(x ■ti)-p, where x is £2, cot 6*2^2, b\ or any other com- 
bination of them, then one needs to introduce other cor- 
relation functions in order to have a closed ensemble of 
recursion equations. By defining the vector 



We will consider a polypeptide chain in a phase where 
the local interactions dominate the behavior of the cor- 
relation functions, to be studied below, at least at short 
and intermediate distances along the chain. We thus ne- 
glect the steric interactions apart from the effect that 
they have on neighboring nodes of the chain. The corre- 
lation functions we will consider involve the unit vectors 
U parallel to r^+i — r^ and the binormal b\ = (ti x i;-i)/ | 
ti x ti^i \. Note that, in order to facilitate the calcula- 
tions, our definition for the tangent vector is different 
from the one used in Fig. Ejin Section IV. The geometri- 
cal constraints of hydrogen bond formation are associated 
with the binormal vector, whose definition is unchanged - 
the binormal vector is perpendicular to the plane defined 
by r.i_i,r, r i+ i. Let 9i e (0, 7r) be the angle between ti 
and and fa S (— tt,tt) the angle by which the plane 



v= < x ;^-i)p (F4 ) 

V (x-i,)p) / 

the recursions equations can be written in a compact 
form in terms of the vectors V's and the transfer matrix 
% 

V, = liVi.t , (F5) 
where the non-zero matrix elements of % are 

U1.1 = (cos0)i , tn.2 = -(-r-^) i Asm 9 cos (j>)i, 
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Ui,3 = (sinflcos^), , i,i,4 = (smOsmfyi , 

ti2,l = 1 ! 

ii3,i = (cot cos 0)i, ti3,2 = -(-r-z) 4 , (cos6»cos0) l , 
^3,3 = <cos6»cos0)i , t i3)4 = (cos 6 sin 4>)i , 

*i4,2 = (^-^) i _ 1 (sm<P) l , 
x sin 6/ 4 1 

*i4,3 = -(sin0)j , tj4,4 = (cos4>)i . (F6) 

Thus given the initial condition V2, which depends 
only on P2(&), all successive V's can be calculated recur- 
sively using eq. (|F5|1 . Let us discuss the case of a uniform 
stretch where pi(9,<fr), and thence %, does not depend 
on i (the sub-indices z's will be omitted in this case). If 
the left and right eigenvectors of T, W p and W re- 
spectively, corresponding to the eigenvalue A^, form a 
complete basis set the general solution of ea. <|F5[l can be 
written as 

4 

v " = a t 2w m ■ v 2 wm ■ (F7) 

If all the eigenvalues are real and positive and if A = 
max /x= i i ... i 4{A /J } then for large n 

<< 2 • L+2) ~ A™ , (F8) 

and likewise for (62 • b n ). On physical grounds, we expect 
that A < 1, so that the correlation functions decay ex- 
ponentially with the distance measured along the chain. 
However it is quite common that some eigenvalues are 
complex. Since the matrix T is real, complex eigenval- 
ues occur in pairs of complex conjugate values. If the pair 
A± = exp(±z%— 1/£) (x and £ are both real and positive) 
corresponds to the maximum modulus eigenvalue, then 
at large n we get, for example, for the tangent-tangent 
correlation, 

(*a ' 4+2) ~ cos( Xo + n X )e-™/« , (F9) 



where xo depends on the initial conditions. Thus there 
is still an exponential decay with a correlation length £ 
(in units of chain bond length), but there is also an oscil- 
latory modulation with another length scale, 1/x, which 
corresponds to short range order along the chain (no- 
tice that, in one dimensional systems such as our chain, 
long range order cannot occur if the interactions are short 
range as in the present case) . This type of behavior, with 
l/x ~ 3.6, would be expected on a stretch of chain that 
adopts a helical conformation with 3.6 amino acids per 
turn. 

We end this appendix with an example of such be- 
havior which can be worked out in full detail. For the 
case pi(9,(j>) = pi(6,—<p) (which corresponds to invari- 
ance under chirality flipping), and pi is independent of 
i for i > 2, then tk,i = t±.k = with k =/= 4. This im- 
plies that the matrix T becomes block diagonal with an 
eigenvalue equal to (cos0) and (x • b n ) decays exponen- 
tially with a correlation length — 1/ ln(cos cf>). Further- 
more, since £1,2^3,3 = ^1.3^3.2, one eigenvalue is zero and 
the remaining two are the solutions of the second order 
equation A 2 + 6A + c = with 

b = -tl,l - £ 3 ,3 j c = *l, 1*3,3 - *1,3*3,1 - *1,2 ■ (F10) 



Thus, if b 2 — 4c > 0, the two solutions are real and the 
tangent-tangent correlation decays exponentially to zero. 
On the other hand, if b 2 — 4c < 0, the two solutions are 
complex conjugate of each other, as described above, and 
in the particular case we are considering, i.e. pi(9,(f>) = 
Pi(9, —(f)), one finds for all n, that 

(k-i n+2 )= C ° S{X0 + nX) e-^, (Fll) 



where £ = — 2/lnc, x = arccos (— b/2y/c), and xo de- 
pends on the initial conditions. 
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